python - 如何使用 spacy 找到最常用的单词？

我将 spacy 与 python 一起使用，它可以很好地标记每个单词，但我想知道是否有可能在字符串中找到最常见的单词。也可以得到最常用的名词、动词、副词等吗？

包含一个 count_by 函数，但我似乎无法让它以任何有意义的方式运行。

最佳答案

我最近不得不计算文本文件中所有标记的频率。您可以使用 pos_ 属性过滤掉单词以获得您喜欢的 POS token 。这是一个简单的例子:

import spacy
from collections import Counter
nlp = spacy.load('en')
doc = nlp(u'Your text here')
# all tokens that arent stop words or punctuations
words = [token.text
         for token in doc
         if not token.is_stop and not token.is_punct]

# noun tokens that arent stop words or punctuations
nouns = [token.text
         for token in doc
         if (not token.is_stop and
             not token.is_punct and
             token.pos_ == "NOUN")]

# five most common tokens
word_freq = Counter(words)
common_words = word_freq.most_common(5)

# five most common noun tokens
noun_freq = Counter(nouns)
common_nouns = noun_freq.most_common(5)

关于python - 如何使用 spacy 找到最常用的单词？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37253326/

上一篇：python - 将 PySpark DataFrame ArrayType 字段组合成单个 ArrayType 字段

下一篇：python - Django JSONField 过滤

相关文章：

python - 使用 Python 以编程方式访问 Google Drive

python - 如何使用 NLTK 分词器去除标点符号？

python - 命名实体识别——与字典直接匹配

python - 如何使用Python将变量传递到MySQLdb查询中？

python - 方法对象不可订阅

python - 在附加条件下与 pd.NamedAgg 聚合

machine-learning - word2vec 中互为标量倍数的词向量之间期望的语义关系是什么？

python - 模糊匹配 pyspark 数据帧字符串中的单词

docker - 在Heroku上设置多阶段Docker构建

python - 检测德语句子中的时态(使用 SpaCy)