python - 为相似名词创建空间知识库

spacy文档中的实体链接示例都是基于命名实体的。是否有可能创建一个知识渊博的知识库，将某些名词与某些名词联系起来？

例如，如果输入错误，将“aeroplane”改为“plane”和“aeroplane”？这样我就可以预先定义可用于“飞机”的可能替代术语。有具体例子吗？

我在知识库中尝试过:

vocab = nlp.vocab
kb = KnowledgeBase(vocab=vocab, entity_vector_length=64)
kb.add_entity(entity="Aeroplane", freq=32, entity_vector=vector1)

如下所述:https://spacy.io/api/kb

但我不知道使用什么作为entity_vector，它应该是实体的预训练向量。

我在文档中看到的另一个例子是这样的:

nlp = spacy.load('en_core_web_sm')
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=3)

# adding entities
kb.add_entity(entity="Q1004791", freq=6, entity_vector=[0, 3, 5])
kb.add_entity(entity="Q42", freq=342, entity_vector=[1, 9, -3])
kb.add_entity(entity="Q5301561", freq=12, entity_vector=[-2, 4, 2])

# adding aliases
kb.add_alias(alias="Douglas", entities=["Q1004791", "Q42", "Q5301561"], probabilities=[0.6, 0.1, 0.2])
kb.add_alias(alias="Douglas Adams", entities=["Q42"], probabilities=[0.9])

我们不能使用 wiki id 以外的任何东西吗？我如何获得这些向量长度？

最佳答案

让我尝试解决您的问题:

The entity linking examples in spacy's documentation are all based on named entities. Is it possible create a knowledgeable such that it links certain nouns with certain nouns?

您可以使用 EL 算法通过一些调整来链接非命名实体。从理论上讲，底层的机器学习模型实际上着眼于句子相似性，并没有太多使用单词/短语是否被命名实体这一事实。

spaCy 的内部结构目前确实假设您正在 NER 结果上运行 EL 算法。这意味着它只会尝试链接存储在 doc.ents 中的 Span 对象。作为解决方法，您可以确保您尝试链接的单词在 doc.ents 中注册为命名实体。。您可以train a custom NER algorithm识别您的特定术语，或运行 rule-based matching strateg y 并使用其结果设置 doc.ents。

Can't we use anything else than wiki ids?

当然 - 您可以使用任何您喜欢的内容，只要 ID 是唯一的字符串即可。假设您使用唯一的字符串“AIRPLANE”来表示“飞机”这一概念。

but I don't know what to use as the entity_vector, which is supposed to be a pre-trained vector of the entity.

实体向量是概念的嵌入表示，它将与出现别名的句子的嵌入进行比较，以确定它们在语义上是否匹配。

这里还有更多文档:https://spacy.io/usage/training#kb

如果您确保拥有一个带有预训练向量的模型，通常是 _md 和 _lg models .

然后，您需要对数据库中的实体进行某种描述。对于维基数据，我们使用了实体的描述，例如 https://www.wikidata.org/wiki/Q197 中的“动力固定翼飞机” 。您还可以采用 Wikipedia article 的第一句，或者任何你想要的东西。只要它提供了一些有关您的概念的背景信息。

让我尝试用一些示例代码来阐明上述所有内容:

nlp = spacy.load(model)
vectors_dim = nlp.vocab.vectors.shape[1]
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=vectors_dim)

airplane_description = "An airplane or aeroplane (informally plane) is a powered, fixed-wing aircraft that is propelled forward by thrust from a jet engine, propeller or rocket engine."
airplane_vector = nlp(airplane_description).vector

plane_description = "In mathematics, a plane is a flat, two-dimensional surface that extends infinitely far."
plane_vector = nlp(plane_description).vector

# TODO: Deduce meaningful "freq" values from a corpus: see how often the concept "PLANE" occurs and how often the concept "AIRPLANE" occurs
kb.add_entity(entity="AIRPLANE", freq=666, entity_vector=airplane_vector)
kb.add_entity(entity="PLANE", freq=333, entity_vector=plane_vector)

# TODO: Deduce the prior probabilities from a corpus. Here we assume that the word "plane" most often refers to AIRPLANE (70% of the cases), and infrequently to PLANE (20% of cases)
kb.add_alias(alias="airplane", entities=["AIRPLANE"], probabilities=[0.99])
kb.add_alias(alias="aeroplane", entities=["AIRPLANE"], probabilities=[0.97])
kb.add_alias(alias="plane", entities=["AIRPLANE", "PLANE"], probabilities=[0.7, 0.2])

因此，从理论上讲，如果在数学上下文中有“plane”一词，算法应该知道它比 AIRPLANE 概念更好地匹配 PLANE 概念的(嵌入)描述。

希望有所帮助 - 我很乐意在评论中进一步讨论!

关于python - 为相似名词创建空间知识库，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/64767231/

python - 为相似名词创建空间知识库

上一篇：python - 转换时出现 Coremltools 错误 : "' str' object has no attribute 'decode' "

下一篇：series - 松树脚本: how do you get the length of a series?