python-3.x - 为什么我的词云中缺少非停用词的重要、高排名(高频率)词？

我上传了包含停用词的高频词列表，并使用停用词列表删除停用词。当我打印词云时，它没有显示几个排名较高的非停用词。

例如，“数据”和“学习”比“分布”、“算法”等出现的频率更高。尽管出现频率很高，但“训练”、“回归”等其他词也缺失了。 (不，这些单词都不存在于停用词列表中。)我怎样才能让“数据”、“学习”、“训练”等单词根据它们的频率出现？ (我附上了单词/频率的屏幕截图。)

代码如下:

from wordcloud import WordCloud, STOPWORDS
text = open("test.txt", mode="r", encoding="utf-8").read()

wc = WordCloud(background_color="white", stopwords=STOPWORDS, height=400, width=600)
wc.generate(text)
wc.to_file("my_first_word_cloud.png")

最佳答案

我创建了一个利用 WordCloud generate_from_frequencies() 函数的解决方案，因为我相信这可以让您更好地控制如何预处理输入到 WordCloud 对象中的单词/频率，并且可以帮助调试您的具体情况。

使用您提供的图像中显示的单词/频率值，我创建了一个 test.txt 文件，其中第一行作为标题(即 Word,Frequency)此后的行包含逗号分隔的词频对(见下文)。我假设您有一个类似的文件。

Word,Frequency
the,38732
and,11580
for,7682
...
than,729
follows,723
parameter,718

解决方案

from wordcloud import WordCloud, STOPWORDS

# Generally it is best practice to use the built-in Python 
# context manager to handle files and let it manage closing/clean-up
with open("test.txt", mode="r", encoding="utf-8") as file:
    text = file.read()

# Convert file to dict using built-in python functions 
# and list comprehension, filtering entries that are stop words
# Note: the "[1:]" skips over the header "Word,Frequency" in the
# test.txt file, remove it if there is no header in your file
textDict = {key: int(val) for key, val in [line.split(',') \
    for line in text.split('\n')[1:]] if key not in STOPWORDS}

# If you want to normalise frequency values
import numpy as np
valsArr = np.array([val for val in textDict.values()])
# Calc L1 (Manhattan Distance) & L2 (Euclidean) norms
norm1 = np.abs(valsArr).sum(axis=0)
norm2 = np.sqrt((valsArr**2).sum(axis=0))
# Note: L2 norm squares components which means that outliers can skew results
# Thus, if outliers present, use L1, else use L2
# Normalise frequency values
textDictNorm = {key:textDict[key]/norm2 for key in textDict.keys()}

# For debugging: uncomment to display dicts containing word:frequency pairs
# print(textDict)
# print(textDictNorm)

# Create word cloud object
wc = WordCloud(background_color="white", height=400, width=600)
# Generate word cloud from normalised word:frequency dict
wc.generate_from_frequencies(textDictNorm)
# Export word cloud to file
wc.to_file("my_first_word_cloud.png")

输出

关于python-3.x - 为什么我的词云中缺少非停用词的重要、高排名(高频率)词？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/74133183/

python-3.x - 为什么我的词云中缺少非停用词的重要、高排名(高频率)词？

解决方案

输出

上一篇：r - 在 R 中，如何使用索引修改/重新分配列表元素？

下一篇：css - 如何使用 Tailwind CSS 设置 "background: none"