python - 如何在多个单独的文本中找到最常用的词？

确实是一个简单的问题，但我似乎无法破解它。我有一个按以下方式格式化的字符串:

["category1",("data","data","data")]
["category2", ("data","data","data")]

我将不同类别的帖子称为帖子，我想从数据部分获取最频繁出现的单词。所以我尝试了:

from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)

for cat, text2 in posts:
   tokens = wordpunct_tokenize(text2)
   for token in tokens:
       if token in freq_dict:
           freq_dict[token] += 1
       else:
           freq_dict[token] = 1
   top = sorted(freq_dict, key=freq_dict.get, reverse=True)
   top = top[:50]
   print top

但是，这会给我字符串中每个帖子的前几个词。

我需要一个通用的热门词列表。
但是，如果我从 for 循环中取出 print top，它只会给我上一篇文章的结果。
有人有想法吗？

最佳答案

为什么不直接使用 Counter ？

In [30]: from collections import Counter

In [31]: data=["category1",("data","data","data")]

In [32]: Counter(data[1])
Out[32]: Counter({'data': 3})

In [33]: Counter(data[1]).most_common()
Out[33]: [('data', 3)]

关于python - 如何在多个单独的文本中找到最常用的词？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/16375404/

上一篇：python - Python Re Module 在这个例子中是如何工作的？

下一篇：python - 类型错误 : unsupported operand type(s) for -: 'numpy.ndarray' and 'numpy.ndarray'

相关文章：

python - Python-读取音频流

python 安装工具 : ImportError: cannot import name Library

c - 我的 for 循环出了什么问题？

vba - 我有这个引用的类型不匹配错误

python - 蜘蛛在获取多个失败的 URL 时关闭

Python 内存错误 : cannot allocate array memory

python - Pandas dataframe.corr() 从输入中剥离列

c++ - 使用 std::tie 作为循环目标的范围

windows - 循环语句的批处理文件

javascript - 当 for 循环处理对象数组时 forEach 不起作用