python - Word2vec 向量的长度有什么意义?

标签 python nlp gensim word2vec

我通过 gensim 使用 Word2vec使用在 Google News 上训练的 Google 预训练向量。我注意到我可以通过对 Word2Vec 对象进行直接索引查找来访问的词向量不是单位向量:

>>> import numpy
>>> from gensim.models import Word2Vec
>>> w2v = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
>>> king_vector = w2v['king']
>>> numpy.linalg.norm(king_vector)
2.9022589

但是,在 most_similar方法,不使用这些非单位向量;取而代之的是,标准化版本是从未记录的 .syn0norm 属性中使用的,该属性仅包含单位向量:

>>> w2v.init_sims()
>>> unit_king_vector = w2v.syn0norm[w2v.vocab['king'].index]
>>> numpy.linalg.norm(unit_king_vector)
0.99999994

较大的向量只是单位向量的放大版本:

>>> king_vector - numpy.linalg.norm(king_vector) * unit_king_vector
array([  0.00000000e+00,  -1.86264515e-09,   0.00000000e+00,
         0.00000000e+00,  -1.86264515e-09,   0.00000000e+00,
        -7.45058060e-09,   0.00000000e+00,   3.72529030e-09,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
        ... (some lines omitted) ...
        -1.86264515e-09,  -3.72529030e-09,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00], dtype=float32)

鉴于 Word2Vec 中的单词相似度比较由 cosine similarity 完成,对我来说,非归一化向量的长度意味着什么并不明显——尽管我认为它们的意思是 something,因为 gensim 将它们暴露给我,而不是仅仅暴露 .syn0norm 中的单位向量.

这些非标准化 Word2vec 向量的长度是如何生成的,它们的含义是什么?对于哪些计算,使用归一化向量有意义,什么时候应该使用非归一化向量?

最佳答案

我认为您正在寻找的答案在 2015 年的论文 Measuring Word Significance using Distributed Representations of Words 中有所描述。阿德里安·沙克尔和本杰明·威尔逊。重点:

When a word appears in different contexts, its vector gets moved in different directions during updates. The final vector then represents some sort of weighted average over the various contexts. Averaging over vectors that point in different directions typically results in a vector that gets shorter with increasing number of different contexts in which the word appears. For words to be used in many different contexts, they must carry little meaning. Prime examples of such insignificant words are high-frequency stop words, which are indeed represented by short vectors despite their high term frequencies ...


For given term frequency, the vector length is seen to take values only in a narrow interval. That interval initially shifts upwards with increasing frequency. Around a frequency of about 30, that trend reverses and the interval shifts downwards.

...

Both forces determining the length of a word vector are seen at work here. Small-frequency words tend to be used consistently, so that the more frequently such words appear, the longer their vectors. This tendency is reflected by the upwards trend in Fig. 3 at low frequencies. High-frequency words, on the other hand, tend to be used in many different contexts, the more so, the more frequently they occur. The averaging over an increasing number of different contexts shortens the vectors representing such words. This tendency is clearly reflected by the downwards trend in Fig. 3 at high frequencies, culminating in punctuation marks and stop words with short vectors at the very end.

...

Graph showing the trend described in the previous excerpt

Figure 3: Word vector length v versus term frequency tf of all words in the hep-th vocabulary. Note the logarithmic scale used on the frequency axis. The dark symbols denote bin means with the kth bin containing the frequencies in the interval [2k−1, 2k − 1] with k = 1, 2, 3, . . .. These means are included as a guide to the eye. The horizontal line indicates the length v = 1.37 of the mean vector


4 Discussion

Most applications of distributed representations of words obtained through word2vec so far centered around semantics. A host of experiments have demonstrated the extent to which the direction of word vectors captures semantics. In this brief report, it was pointed out that not only the direction, but also the length of word vectors carries important information. Specifically, it was shown that word vector length furnishes, in combination with term frequency, a useful measure of word significance.

关于python - Word2vec 向量的长度有什么意义?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36034454/

相关文章:

machine-learning - Doc2Vec 适合情感分析吗?

python - 我使用 python 在 html 中创建了一个表,我想将这些输入表值添加到 sql 数据库中

python - 如何控制Peewee中相关模型获取的字段名称?

python - 使用正则表达式查找段落中出现特定短语后的所有名词短语

python - Gensim 中的 FastText

gensim - 使用 Gensim 或其他 python LDA 包来使用 Mallet 中经过训练的 LDA 模型

python - 捕获子进程输出

python - Python 字符串转日期时间的问题

python - 从 pandas 数据框中的句子列表中删除标点符号

C++ 情感分析库