python - 如何获取 pandas 数据框中单词列表(子字符串)的出现次数?

标签 python pandas dataframe find-occurrences

我有一个包含大约 150 万行的 pandas 数据框。我想在某一列中找到特定的、选定的单词(都是已知的)的出现次数。这适用于单个单词。

d = df["Content"].str.contains("word").value_counts()

但我想从列表中找出多个已知单词(如“word1”、“word2”)的出现情况。 word2 也可以是 word2 或 wordtwo,如下所示:

word1           40
word2/wordtwo   120

我该如何实现?

最佳答案

IMO 最有效的方法之一是使用 sklearn.feature_extraction.text.CountVectorizer传递给它一个词汇表(单词列表,你想计算)。

演示:

In [21]: text = """
    ...: I have a pandas data frame with approximately 1.5 million rows. I want to find the number of occurrences of specific, selected words in a certain colu
    ...: mn. This works for a single word. But I want to find out the occurrences of multiple, known words like "word1", "word2" from a list. Also word2 could
    ...: be word2 or wordtwo, like so"""

In [22]: df = pd.DataFrame(text.split('. '), columns=['Content'])

In [23]: df
Out[23]:
                                             Content
0  \nI have a pandas data frame with approximatel...
1  I want to find the number of occurrences of sp...
2                       This works for a single word
3  But I want to find out the occurrences of mult...
4      Also word2 could be word2 or wordtwo, like so

In [24]: from sklearn.feature_extraction.text import CountVectorizer

In [25]: vocab = ['word', 'words', 'word1', 'word2', 'wordtwo']

In [26]: vect = CountVectorizer(vocabulary=vocab)

In [27]: res = pd.Series(np.ravel((vect.fit_transform(df['Content']).sum(axis=0))),
                         index=vect.get_feature_names())

In [28]: res
Out[28]:
word       1
words      2
word1      1
word2      3
wordtwo    1
dtype: int64

关于python - 如何获取 pandas 数据框中单词列表(子字符串)的出现次数?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50187849/

相关文章:

python - 如何删除数据帧索引中单词之间的多余空格?

python - 选择只有一个唯一值的 pandas 数据框列

python - 排序 Pandas DataFrame

python - Pycryptodome RSA 解密导致大规模性能降级 (RPS)

python - Tkinter .after 模块只是延迟了 GUI 的打开

python - 如何通过python中的unix套接字连接到mongodb

python-3.x - 使用 panda 数据帧 groupby 中的百分位数删除异常值

R - describe() 输出到数据框

python - 从数据框中提取相对于其他数据框中的 bin 值的行(不使用列名)

python - 权限错误 : [Errno 13] Permission denied Python