python - Conceptnet Numberbatch(多语言)OOV 词

标签 python word-embedding conceptnet

我正在处理一个文本分类问题(在法语语料库上),并且正在试验不同的词嵌入。我对 ConceptNet 提供的内容非常感兴趣,所以我决定试一试。

我无法为我的特定任务找到专门的教程,所以我听取了他们的建议 blog :

How do I use ConceptNet Numberbatch?

To make it as straightforward as possible:

Work through any tutorial on machine learning for NLP that uses semantic vectors. Get to the part where they tell you to use word2vec. (A particularly enlightened tutorial may tell you to use GloVe 1.2.)

Get the ConceptNet Numberbatch data, and use it instead. Get better results that also generalize to other languages.

您可能会在下面找到我的方法(请注意,'numberbatch.txt' 是包含推荐的多语言版本的文件:ConceptNet Numberbatch 19.08):

embeddings_index = dict()

f = open('numberbatch.txt')

for line in f:
    values = line.split()
    word = values[0]
    coefs = asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Loaded %s word vectors.' % len(embeddings_index))

我首先测试一个词是否存在:

word = 'fille'
missingWords = 0
if word not in embeddings_index:
    missingWords += 1
print(missingWords)

令我惊讶的是,找不到像“fille”(法语中的女孩)这样的简单词。然后我创建了一个函数来打印我的语料库中的所有 OOV 词。分析结果时更让我吃惊的是:超过22k的词没有找到(包括'nous'(we),'être'(to是)等)。

我还尝试了 GitHub page 上提出的方法对于 OOV 词(结果相同):

Out-of-vocabulary strategy

ConceptNet Numberbatch is evaluated with an out-of-vocabulary strategy that helps its performance in the presence of unfamiliar words. The strategy is implemented in the ConceptNet code base. It can be summarized as follows:

Given an unknown word whose language is not English, try looking up the equivalently-spelled word in the English embeddings (because English words tend to end up in text of all languages).

Given an unknown word, remove a letter from the end, and see if that is a prefix of known words. If so, average the embeddings of those known words.

If the prefix is still unknown, continue removing letters from the end until a known prefix is found. Give up when a single character remains.

我的方法有问题吗?

最佳答案

您是否考虑了 ConceptNet Numberbatch 的格式?如图project's GitHub ,它看起来像这样:

/c/en/absolute_value -0.0847 -0.1316 -0.0800 -0.0708 -0.2514 -0.1687 -...

/c/en/absolute_zero 0.0056 -0.0051 0.0332 -0.1525 -0.0955 -0.0902 0.07...

这种格式意味着 fille 不会被找到,但是 /c/fr/fille 会被找到。

关于python - Conceptnet Numberbatch(多语言)OOV 词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64717185/

相关文章:

python - 从 Django Queryset 获取值列表的最有效方法

python - Tensorflow 嵌入层词汇量大小

nlp - 在发送到 RNN 之前,token <pad>, <unknown>, <go>, <EOS> 的词向量应该是什么?

machine-learning - NLP - 句子标记的 `start` 和 `end` 的嵌入选择

nlp - 哪个更好? OpenCyc 还是 ConceptNet?

python - pip:使用 sudo 或不使用 sudo

python - DSP : audio processing : squart or log to leverage fft?

Python warn 打印消息并在 Windows cmd 上调用