python - gensim 的 'docvecs' 是什么？

标签 python nlp gensim doc2vec

上图来自Distributed Representations of Sentences and Documents ，介绍Doc2Vec的论文。我正在使用 Gensim 的 Word2Vec 和 Doc2Vec 实现，它们很棒，但我希望在一些问题上得到澄清。

对于给定的 doc2vec 模型 dvm，什么是 dvm.docvecs？我的印象是它是包含所有词嵌入和段落向量 d 的平均或连接向量。这是正确的还是错误的？
假设dvm.docvecs不是d，是否可以单独访问d？怎么样？
作为奖励，d 是如何计算的？论文只说:

In our Paragraph Vector framework (see Figure 2), every paragraph is mapped to a unique vector, represented by a column in matrix D and every word is also mapped to a unique vector, represented by a column in matrix W.

感谢任何线索!

最佳答案

Doc2Vec 模型的 docvecs 属性包含训练期间看到的“文档标签”的所有训练向量。 (这些在源代码中也称为“doctags”。)

在最简单的情况下，类似于段落向量论文，每个文本示例(段落)只有一个序列号整数 ID 作为其“标签”，从 0 开始。这将是 docvecs 对象的索引——并且model.docvecs.doctag_syn0 numpy 数组本质上与段落向量论文摘录中的(大写)D 相同。

(Gensim 还支持使用字符串标记作为文档标签，每个文档有多个标签，以及在许多训练文档中重复标记。对于字符串标签，如果有的话，它们被映射到靠近 docvecs 末尾的索引字典 model.docvecs.doctags .)

关于python - gensim 的 'docvecs' 是什么？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41709318/

上一篇： python Pandas : Reindex DataFrame after Timezone conversion

下一篇：python - asyncio 中的坏锁优化

pickle - Gensim Pickle错误: Enable to Load the Saved Topic Model

text - "document"在 NLP 上下文中意味着什么？

java - Matcher 在 Ubuntu 和 Windows 上给出不同的结果

nlp - 我们可以以分布式方式构建 word2vec 模型吗？

machine-learning - 如何使用 gensim fasttext 包装器训练词嵌入表示？

python - Pandas 列内的映射值

python - 如何在wxPython中使用鼠标旋转matplotlib 3D绘图？

python - 在 Keras 中为 flow_from_directory 使用多个目录

java - 将字典中的子字符串匹配到其他字符串 : suggestions?