python - NLTK 中的 NgramCollocationFinder

我有一个 ngram 术语列表，我想使用 NLTK 工具包中的测试对术语进行排名。但在 NLTK.collocations 中只有 BigramCollocationFinder、TrigramCollocationFinder、QuadgramCollocationFinder。如果条款列表中有 5 克、6 克，我该怎么办？

最佳答案

为了实现 NGramCollocationFinder，您需要摆脱多个 i&x 变量。要摆脱它们，您需要看到所使用的模式都是 n 项列表的组合。下一步是使用此组合作为键将变量替换为字典。

最后，如果组合集中存在索引，您需要构建一些逻辑来根据给定的 w# 变量更新每个组合。这是可以完成的，但我建议初学者对 n=3 或 n=4 执行此操作，您可以在其中验证现有类中的逻辑。当这些正确时，您可以将其用于更大的 n。

寻找组合

the recipes section 中有一个 powerset() 生成器的the itertools documentation您可以使用它来实现组合¹。

def powerset(iterable):
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))

这里，(1,2)元组对应于iix变量，(1,3)元组对应于ixi 变量。因此，根据元组长度以及不同索引的存在，可以替换所有 i&x 变量。

进行元组逻辑

实现目标所需的另一个工具是能够添加到元组。这是扩展/替换 score_ngram() 中的参数所必需的。这是一个关于如何添加到元组的非常简单的示例:

a = (1, 2)
b = a + (3, )    # Notice the trailing comma to make it one element tuple
# b is now (1, 2, 3)

正如他们所说，剩下的就留给你去实现。有关需要分析的部分的一些帮助，请参阅 my answer关于相关问题:“Transform QuadgramCollationFinder into PentagramCollationFinder”。

<小时/>

_{¹ 感谢Cyphase在 this answer 中对此进行了描述}

关于python - NLTK 中的 NgramCollocationFinder，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33021298/

python - NLTK 中的 NgramCollocationFinder

寻找组合

进行元组逻辑

上一篇：python - Django ModelForm 无法使用自定义日期格式正确验证

下一篇：python - gzip 意外出现文件结尾