python - 训练有素的 word2vec 模型词汇表中缺少的单词

我目前正在使用 Python，使用我提供的句子训练 Word2Vec 模型。然后，我保存并加载模型以获得用于训练模型的句子中每个单词的词嵌入。但是，我收到以下错误。

KeyError: "word 'n1985_chicago_bears' not in vocabulary"

然而，训练期间提供的其中一个句子如下。

sportsteam n1985_chicago_bears teamplaysincity city chicago

因此，我想知道为什么词汇表中缺少一些词，尽管已根据该句子语料库中的这些词进行了训练。

在自己的语料库上训练word2vec模型

import nltk
import numpy as np
from termcolor import colored
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from sklearn.decomposition import PCA


#PREPARING DATA

fname = '../data/sentences.txt'

with open(fname) as f:
    content = f.readlines()

# remove whitespace characters like `\n` at the end of each line
content = [x.strip() for x in content]


#TOKENIZING SENTENCES

sentences = []

for x in content:
    nltk_tokens = nltk.word_tokenize(x)
    sentences.append(nltk_tokens)

#TRAINING THE WORD2VEC MODEL

model = Word2Vec(sentences)

words = list(model.wv.vocab)
model.wv.save_word2vec_format('model.bin')

sentences.txt 中的例句

sportsteam hawks teamplaysincity city atlanta
stadiumoreventvenue honda_center stadiumlocatedincity city anaheim
sportsteam ducks teamplaysincity city anaheim
sportsteam n1985_chicago_bears teamplaysincity city chicago
stadiumoreventvenue philips_arena stadiumlocatedincity city atlanta
stadiumoreventvenue united_center stadiumlocatedincity city chicago
...

sentences.txt 文件中有 1860 行这样的行，每行恰好包含 5 个单词且没有停用词。

保存模型后，我尝试从与保存的 model.bin 相同目录中的不同 python 文件加载它，如下所示。

加载保存的model.bin

import nltk
import numpy as np
from gensim import models

w = models.KeyedVectors.load_word2vec_format('model.bin', binary=True)
print(w['n1985_chicago_bears'])

但是，我最终遇到了以下错误

KeyError: "word 'n1985_chicago_bears' not in vocabulary"

有没有一种方法可以使用相同的方法为训练好的句子语料库中的每个单词获取词嵌入？

在这方面的任何建议将不胜感激。

最佳答案

gensim 的 Word2Vec 实现的默认 min_count=5 看起来像您要查找的标记 n1985_chicago_bears 在您的语料库中出现少于 5 次。适本地更改您的最小计数。

Method signature:

class gensim.models.word2vec.Word2Vec(sentences=None, corpus_file=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=, iter=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000, compute_loss=False, callbacks=(), max_final_vocab=None)

content = [
    "sportsteam hawks teamplaysincity city atlanta",
    "stadiumoreventvenue honda_center stadiumlocatedincity city anaheim",
    "sportsteam ducks teamplaysincity city anaheim",
    "sportsteam n1985_chicago_bears teamplaysincity city chicago",
    "stadiumoreventvenue philips_arena stadiumlocatedincity city atlanta",
    "stadiumoreventvenue united_center stadiumlocatedincity city chicago"
]

sentences = []

for x in content:
    nltk_tokens = nltk.word_tokenize(x)
    sentences.append(nltk_tokens)

model = Word2Vec(sentences, min_count=1)
print (model['n1985_chicago_bears'])

关于python - 训练有素的 word2vec 模型词汇表中缺少的单词，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56033651/

python - 训练有素的 word2vec 模型词汇表中缺少的单词

上一篇：python - 加载文件中的错误 ylabels

下一篇：python - 转换后如何更新数据框的列？