python - word2vec 出现编码错误

执行代码时出现以下错误

Traceback (most recent call last):
  File "test.py", line 21, in <module>
    print model.most_similar(positive=['男人'])
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 660, in most_similar
    raise KeyError("word '%s' not in vocabulary" % word)
KeyError: "word '\xe7\x94\xb7\xe4\xba\xba' not in vocabulary"

这里是我的代码

 # -*- coding: utf8 -*    
    from gensim.models import word2vec
    import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
        sentences = word2vec.Text8Corpus('/tmp/text8')
        model = word2vec.
    Word2Vec(sentences, size=200)
        model.most_similar(['男人'])

最佳答案

"it works by the following changes. model.most_similar([u'男人'])"

这意味着您可能正在使用 utf-8 编码的字符串而不是 unicode 字符串，一个好的做法是使用 unicode 对输入进行解码，然后对输出进行编码。

.decode('utf-8') 你的字符串

关于python - word2vec 出现编码错误，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31740413/

上一篇：python - 有选择地删除 pandas 数据框中已弃用的行

下一篇：python - 使用 NumPy 数组作为 NumPy 数组的索引

相关文章：

python - 按列迭代 scipy 稀疏矩阵

python - 如何在 Python 密码学中为 ECDSA (secp256k1) 生成较短的私钥

TensorFlow 嵌入查找

java - 为什么 Spark 的 Word2Vec 返回一个 vector ？

python - 从 gensim word2Vec 获取权重矩阵

gensim - Word2Vec:使用的窗口大小的影响

python - 如何在保存图像时向图像添加颜色条？

python - 如何存储正则表达式中的关键术语列表(或字典)？ -Python

python - DjangoRestFramework : AttributeError: 'str' object has no attribute '_meta'

python - 如何通过word2vec获取反义词？