python - 如何仅计算字典中的单词,同时返回字典键名称的计数

标签 python pandas dictionary nltk data-science

我想通过短信发送一个 Excel 文件。首先,我必须将所有行连接到一个大文本文件中。然后,扫描文本以查找字典中的单词。如果找到该单词,则将其计为字典键名称。最后返回关系表[word, count]中统计的单词列表。 我可以数单词,但无法让字典部分工作。 我的问题是:

  1. 我的处理方式正确吗?
  2. 这可能吗?如何实现?

调整了来自互联网的代码


import collections
import re
import matplotlib.pyplot as plt
import pandas as pd
#% matplotlib inline
#file = open('PrideAndPrejudice.txt', 'r')
#file = file.read()

''' Convert excel column/ rows into a string of words'''
#text_all = pd.read_excel('C:\Python_Projects\Rake\data_file.xlsx')
#df=pd.DataFrame(text_all)
#case_words= df['case_text']
#print(case_words)
#case_concat= case_words.str.cat(sep=' ')
#print (case_concat)
text_all = ("Billy was glad to see jack. Jack was estatic to play with Billy. Jack and Billy were lonely without eachother. Jack is tall and Billy is clever.")
''' done'''
import collections
import pandas as pd
import matplotlib.pyplot as plt
#% matplotlib inline
# Read input file, note the encoding is specified here 
# It may be different in your text file

# Startwords
startwords = {'happy':'glad','sad': 'lonely','big': 'tall', 'smart': 'clever'}
#startwords = startwords.union(set(['happy','sad','big','smart']))

# Instantiate a dictionary, and for every word in the file, 
# Add to the dictionary if it doesn't exist. If it does, increase the count.
wordcount = {}
# To eliminate duplicates, remember to split by punctuation, and use case demiliters.
for word in text_all.lower().split():
    word = word.replace(".","")
    word = word.replace(",","")
    word = word.replace(":","")
    word = word.replace("\"","")
    word = word.replace("!","")
    word = word.replace("“","")
    word = word.replace("‘","")
    word = word.replace("*","")
    if word  in startwords:
        if word  in wordcount:
            wordcount[word] = 1
        else:
            wordcount[word] += 1
# Print most common word
n_print = int(input("How many most common words to print: "))
print("\nOK. The {} most common words are as follows\n".format(n_print))
word_counter = collections.Counter(wordcount)
for word, count in word_counter.most_common(n_print):
    print(word, ": ", count)
# Close the file
#file.close()
# Create a data frame of the most common words 
# Draw a bar chart
lst = word_counter.most_common(n_print)
df = pd.DataFrame(lst, columns = ['Word', 'Count'])
df.plot.bar(x='Word',y='Count')

错误:空“DataFrame”:没有要绘制的数字数据

预期输出:

  1. 快乐1
  2. 悲伤1
  3. 大1
  4. 聪明1

最佳答案

以下方法适用于最新版本的 pandas(撰写本文时为 0.25.3):

# Setup
df = pd.DataFrame({'case_text': ["Billy was glad to see jack. Jack was estatic to play with Billy. Jack and Billy were lonely without eachother. Jack is tall and Billy is clever."]})

startwords = {"happy":["glad","estatic"],
              "sad": ["depressed", "lonely"],
              "big": ["tall", "fat"],
              "smart": ["clever", "bright"]}

# First you need to rearrange your startwords dict
startwords_map = {w: k for k, v in startwords.items() for w in v}

(df['case_text'].str.lower()     # casts to lower case
 .str.replace('[.,\*!?:]', '')   # removes punctuation and special characters
 .str.split()                    # splits the text on whitespace
 .explode()                      # expands into a single pandas.Series of words
 .map(startwords_map)            # maps the words to the startwords
 .value_counts()                 # counts word occurances
 .to_dict())                     # outputs to dict

[输出]

{'happy': 2, 'big': 1, 'smart': 1, 'sad': 1}

关于python - 如何仅计算字典中的单词,同时返回字典键名称的计数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59054314/

相关文章:

python - 带有可选占位符的 string.format()

python - django-rest-swagger 不能很好地与模型序列化器一起使用吗?

python - 如何用 numpy 计算统计信息 "t-test"

python - 如何处理负值的 pct_change

c# - 如何从此方法返回 Dictionary<string,Dictionary<int,decimal>> ?

xml - 有没有办法在应用程序设置中使用字典或 xml?

python - 通过管道传输到 unistd.h 读取段错误

python - 来自 Pandas 混淆矩阵的 Bokeh 热图

python - 向左合并混合数量的标识符

python - 构建有向多重图 (Python)