python - 将 word2vec bin 文件转换为文本

标签 python c gensim word2vec

来自 word2vec网站我可以下载 GoogleNews-vectors-negative300.bin.gz。 .bin 文件(大约 3.4GB)是一种对我没用的二进制格式。托马斯·米科洛夫 assures us “将二进制格式转换为文本格式应该相当简单(尽管这会占用更多磁盘空间)。检查距离工具中的代码，读取二进制文件相当简单。”不幸的是，我对 C 的了解不够多，无法理解 http://word2vec.googlecode.com/svn/trunk/distance.c .

据说是 gensim也可以这样做，但我发现的所有教程似乎都是关于转换 from 文本，而不是其他方式。

有人可以建议修改 C 代码或 gensim 发出文本的指令吗？

最佳答案

我使用此代码加载二进制模型，然后将模型保存到文本文件，

from gensim.models.keyedvectors import KeyedVectors

model = KeyedVectors.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)

引用:API和 nullege .

注意:

以上代码适用于新版本的 gensim。对于以前的版本，我使用了这个代码:

from gensim.models import word2vec

model = word2vec.Word2Vec.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)

关于python - 将 word2vec bin 文件转换为文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27324292/

上一篇：python - 在 Python 中导入模块 - 最佳实践

下一篇：python - logging.info 不会出现在控制台上，但会出现警告和错误

相关文章：

python - 使用 ctypes 通过引用接收传递对象

python - 使用gensim加载word2vec时出现内存错误

python - 将keras api用于具有相同目标值的多输出模型

python - Tkinter:在网格布局中获取按钮上方的图像

python - 如何像bash一样在python中扩展环境变量？

c - 直线与圆柱(环)相交的截距长度

c++ - 指针数组，如 (*(volatile unsigned long *)0x40004000)

python - 如何在python中安装gensim并运行包？

visualization - 来自gensim的pyLDAvis可视化未在google colab中显示结果

Python : Retrieve ttk. 帧大小