machine-learning - 训练集中某个短语的原型(prototype)向量是什么

我正在尝试实现以下方法 a paper消除实体的歧义。该过程由 2 个步骤组成:训练阶段和消歧阶段。我想问一下训练阶段，我不太明白如何获得原型(prototype)向量，如本段所述:

In the training phase, we compute, for each word or phrase that is linked at least 10 times to a particular entity, what we called a prototype vector: this is a tf.idf-weighted, normalized list of all terms which occur in one of the neighbourhoods (we consider 10 words to the left and right) of the respective links. Note that one and the same word or phrase can have several such prototype vectors, one for each entity linked from some occurrence of that word or phrase in the collection.

他们使用了维基百科的方法，并使用维基百科的链接作为训练集。

有人可以帮我举一个原型(prototype)向量的例子吗？我是这个领域的初学者。

最佳答案

以下是原型(prototype)向量的概述:

首先要注意的是，维基百科中的单词可以超链接到 wikipedia页面(我们将其称为实体)。 这个entity在某种程度上与该单词相关，但同一个单词可以链接到不同的实体。

“对于与特定实体链接至少 10 次的每个单词或短语”

在维基百科中，我们计算 word_A 链接到 entity_B 的次数，如果超过 10，我们继续(记下它们链接的实体的位置):

[(wordA, entityA1), (wordA, entityA2),...]

这里 wordA 出现在 entityA1 中，它链接到 entityB 等。

“出现在相应链接的邻域之一的所有术语的列表”

在 entityA1 中，wordA 左右各有 10 个单词(我们只在两侧显示 4 个单词):

are developed and the entity relationships between these data
                      wordA
                      link # (to entityB)

['are', 'developed, 'and', 'the', 'relationships', 'between', 'these', 'data']

每一对(wordA,entityAi)给我们一个这样的列表，将它们连接起来。

“tf.idf加权，标准化列表”

基本上，tf.idf意味着你应该给予常见单词比不常见单词更少的“权重”。例如，'and' 和 'the' 是非常常见的单词，因此我们赋予它们较少的含义(因为它们位于 旁边) '实体')而不是'关系'或'之间'。

归一化，意味着我们应该(本质上)计算一个单词出现的次数(出现的次数越多，我们认为它与 wordA 的关联性就越大。然后将此计数乘以权重即可得到用于对列表进行排序的一些分数...将最常见的最不常见的单词放在顶部。

“请注意，同一个单词或短语可以有多个这样的原型(prototype)向量”

这不仅依赖于 wordA，还依赖于 entityB，您可以将其视为映射。

(wordA, entityB) -> tf.idf-weighted, normalized list (as described above)
(wordA, entityB2) -> a different tf.idf-weighted, normalized list

这表明链接到 cats与指向 cat woman 的链接相比，来自单词 'cat' 的邻居 'batman' 的可能性较小.

关于machine-learning - 训练集中某个短语的原型(prototype)向量是什么，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/12621341/

machine-learning - 训练集中某个短语的原型(prototype)向量是什么

“对于与特定实体链接至少 10 次的每个单词或短语”

“出现在相应链接的邻域之一的所有术语的列表”

“tf.idf加权，标准化列表”

“请注意，同一个单词或短语可以有多个这样的原型(prototype)向量”

上一篇：machine-learning - 文件路径名或 URL 分析

下一篇：machine-learning - 音乐特征提取/音乐信息检索工具