python - 如何使用 NLTK BigramAssocMeasures.ch_sq

我有单词列表，我想通过考虑两个单词的共现来计算它们的相关性。从一篇论文中我发现它可以使用 PIL 森卡方检验来计算。我还找到了用于计算卡方值的 nltk.BigramAssocMeasures.ch_sq() 。

我可以用它来满足我的需要吗？如何使用 nltk 找到卡方值？

最佳答案

看看this blog from Streamhacker ，它通过代码示例给出了很好的解释。

One of the best metrics for information gain is chi square. NLTK includes this in the BigramAssocMeasures class in the metrics package. To use it, first we need to calculate a few frequencies for each word: its overall frequency and its frequency within each class. This is done with a FreqDist for overall frequency of words, and a ConditionalFreqDist where the conditions are the class labels. Once we have those numbers, we can score words with the BigramAssocMeasures.chi_sq function, then sort the words by score and take the top 10000. We then put these words into a set, and use a set membership test in our feature selection function to select only those words that appear in the set. Now each file is classified based on the presence of these high information words.

关于python - 如何使用 NLTK BigramAssocMeasures.ch_sq，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/15401497/

上一篇：python - 如何使用 msvcrt.getch 提取和使用输入？

下一篇：python - pandas DataFrame的combine_first和update方法有奇怪的行为

python - 在多 GPU 的情况下，CPU 代码可以存在于 "with tf.device(gpu_id) :"下吗？

python - groupby 在带有文本的列中使用时返回多级数据框

Regex/"token_pattern"用于 scikit-learn 文本向量化器

java - 使用 Java 版 Mallet Api 进行主题建模

c - n 值的非重复二进制对

python - 如何处理 Robot Framework RIDE 中的 Windows 身份验证弹出窗口？

python - 如何使用 NLTK 正则表达式模式用 UP/DOWN 指标注释财经新闻？

algorithm - 潜在语义索引(LSI)是一种统计分类算法吗？

apache-spark - 处理 spark mllib 分类器中的 null/NaN 值