nlp - gensim 如何计算 doc2vec 段落向量

标签 nlp vectorization gensim word2vec doc2vec

我正在阅读这篇论文 http://cs.stanford.edu/~quocle/paragraph_vector.pdf

它指出

" Theparagraph vector and word vectors are averaged or concatenated to predict the next word in a context. In the experiments, we use concatenation as the method to combine the vectors."



串联或平均如何工作?

示例(如果第 1 段包含 word1 和 word2):
word1 vector =[0.1,0.2,0.3]
word2 vector =[0.4,0.5,0.6]

concat method 
does paragraph vector = [0.1+0.4,0.2+0.5,0.3+0.6] ?

Average method 
does paragraph vector = [(0.1+0.4)/2,(0.2+0.5)/2,(0.3+0.6)/2] ?

同样来自这张图片:

据说:

The paragraph token can be thought of as another word. It acts as a memory that remembers what is missing from the current context – or the topic of the paragraph. For this reason, we often call this model the Distributed Memory Model of Paragraph Vectors (PV-DM).



段落标记是否等于等于 on 的段落向量?

enter image description here

最佳答案

How does concatenation or averaging work?



你做对了平均水平。串联是:[0.1,0.2,0.3,0.4,0.5,0.6] .

Is the paragraph token equal to the paragraph vector which is equal to on?



“段落标记”被映射到称为“段落向量”的向量。它不同于标记“on”,也不同于标记“on”映射到的词向量。

关于nlp - gensim 如何计算 doc2vec 段落向量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40413866/

相关文章:

javascript - 从字符串中提取所有电话号码

r - NLP - 在 R 中识别和替换单词(同义词)

python - 从 2D 数组创建矢量化 numpy.meshgrid 以创建 3D 网格

c++ - 如何在 block 复制期间向量化范围检查?

python - Gensim Doc2Vec most_similar() 方法没有按预期工作

python-3.x - 使用 NLP 模型查找该语句中存在的特定对象

machine-learning - 具有多个类别的 NER 的 CRF

python pandas 跨列条件计数

nlp - 主题模型中的动态主题数

machine-learning - 如何在新闻文章中使用 gensim 进行 lda?