我上传了包含停用词的高频词列表,并使用停用词列表删除停用词。当我打印词云时,它没有显示几个排名较高的非停用词。
例如,“数据”和“学习”比“分布”、“算法”等出现的频率更高。尽管出现频率很高,但“训练”、“回归”等其他词也缺失了。 (不,这些单词都不存在于停用词列表中。)我怎样才能让“数据”、“学习”、“训练”等单词根据它们的频率出现? (我附上了单词/频率的屏幕截图。)
代码如下:
from wordcloud import WordCloud, STOPWORDS
text = open("test.txt", mode="r", encoding="utf-8").read()
wc = WordCloud(background_color="white", stopwords=STOPWORDS, height=400, width=600)
wc.generate(text)
wc.to_file("my_first_word_cloud.png")
最佳答案
我创建了一个利用 WordCloud generate_from_frequencies()
函数的解决方案,因为我相信这可以让您更好地控制如何预处理输入到 WordCloud 对象中的单词/频率,并且可以帮助调试您的具体情况。
使用您提供的图像中显示的单词/频率值,我创建了一个 test.txt
文件,其中第一行作为标题(即 Word,Frequency
)此后的行包含逗号分隔的词频对(见下文)。我假设您有一个类似的文件。
Word,Frequency
the,38732
and,11580
for,7682
...
than,729
follows,723
parameter,718
解决方案
from wordcloud import WordCloud, STOPWORDS
# Generally it is best practice to use the built-in Python
# context manager to handle files and let it manage closing/clean-up
with open("test.txt", mode="r", encoding="utf-8") as file:
text = file.read()
# Convert file to dict using built-in python functions
# and list comprehension, filtering entries that are stop words
# Note: the "[1:]" skips over the header "Word,Frequency" in the
# test.txt file, remove it if there is no header in your file
textDict = {key: int(val) for key, val in [line.split(',') \
for line in text.split('\n')[1:]] if key not in STOPWORDS}
# If you want to normalise frequency values
import numpy as np
valsArr = np.array([val for val in textDict.values()])
# Calc L1 (Manhattan Distance) & L2 (Euclidean) norms
norm1 = np.abs(valsArr).sum(axis=0)
norm2 = np.sqrt((valsArr**2).sum(axis=0))
# Note: L2 norm squares components which means that outliers can skew results
# Thus, if outliers present, use L1, else use L2
# Normalise frequency values
textDictNorm = {key:textDict[key]/norm2 for key in textDict.keys()}
# For debugging: uncomment to display dicts containing word:frequency pairs
# print(textDict)
# print(textDictNorm)
# Create word cloud object
wc = WordCloud(background_color="white", height=400, width=600)
# Generate word cloud from normalised word:frequency dict
wc.generate_from_frequencies(textDictNorm)
# Export word cloud to file
wc.to_file("my_first_word_cloud.png")
输出
关于python-3.x - 为什么我的词云中缺少非停用词的重要、高排名(高频率)词?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/74133183/