python - 信息功能不返回西里尔字符

标签 python nltk feature-extraction python-unicode

我现在已经切换到 Python 3.6，但是在运行信息功能时，当我试图在我的功能提取器中打印俄语时，我最终遇到了乱码。

Most Informative Features
  three_last_letters = 'Ð¾Ì'            noun : verb   =      6.6 : 1.0
  three_last_letters = 'Ð³Ð'            noun : verb   =      5.4 : 1.0
  three_last_letters = 'ÐµÐ'            noun : verb   =      4.7 : 1.0
  three_last_letters = 'Ð¼Ð'            noun : verb   =      4.4 : 1.0
  three_last_letters = 'Ð½Ñ'            noun : verb   =      3.5 : 1.0

对于特征提取器本身

def POS_features(word):
    return{'three_last_letters':word[-3:]}
print(POS_features(u'Богатир'))

我可以让 тир 打印得很好，我可以做些什么来让信息功能返回俄语字符吗？

最佳答案

我想出了我做错了什么，

vocab = nltk.corpus.reader.CategorizedPlaintextCorpusReader(
"C:\\Users\\Admin\\AppData\\Roaming\\nltk_data\\corpora\\russian\\vocab", r'.*\.txt', cat_pattern=r'^(noun|verb)', encoding="utf8"

当我导入我的 vocab 文件夹时，我将它编码为 latin-1 一切都很好，西里尔字符已返回给我

 Most Informative Features
      three_last_letters = 'ать'            verb : noun   =     15.2 : 1.0
      three_last_letters = 'де'             noun : verb   =      2.6 : 1.0
      three_last_letters = 'сть'            noun : verb   =      1.5 : 1.0
      three_last_letters = 'пра'            noun : verb   =      1.4 : 1.0
      three_last_letters = 'ина'            noun : verb   =      1.4 : 1.0

关于python - 信息功能不返回西里尔字符，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43850175/

上一篇：python - 从 artifactory 下载文件，使用 API Key 进行身份验证

下一篇：python - 如果没有数据，不要用 csv DictWriter 写入文件

相关文章：

python - Python.requests 安全吗？

python - 使用 ElementTree 强制对不良 XML 文件进行编码

python - 如何提取句子中的主语及其各自的从属短语？

python - 提取非内容英语单词字符串 - python

machine-learning - 股票市场等时间序列数据的特征选择

java - java中图像的视觉相似度

python - View 和 numpy 数组的浅拷贝有什么区别？

Python salt 栈 : How can I manage a file which is in a git repo?

python - 在 Python 3 中已存在 NLTK 时在 Python 2.7 中安装它

java - OpenCV 2.4.11 Java : Drawing lines from center of mass to edge of contour