python - 如何计算DataFrame中字符串中的单词数？

<分区>

假设我们有一个简单的 Dataframe

df = pd.DataFrame(['one apple','banana','box of oranges','pile of fruits outside', 'one banana', 'fruits'])
df.columns = ['fruits']

如何计算关键词的字数，类似:

1 word: 2
2 words: 2
3 words: 1
4 words: 1

最佳答案

然后您可以执行以下操作:

In [89]:
count = df['fruits'].str.split().apply(len).value_counts()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count

Out[89]:
1 words:    2
2 words:    2
3 words:    1
4 words:    1
Name: fruits, dtype: int64

这里我们使用向量化的 str.split按空格拆分，然后 apply len 来获取元素的个数，然后我们可以调用 value_counts聚合频率计数。

然后我们重命名索引并对其进行排序以获得所需的输出

更新

这也可以使用 str.len 而不是 apply 来完成，后者应该可以更好地扩展:

In [41]:
count = df['fruits'].str.split().str.len()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count

Out[41]:
0 words:    2
1 words:    1
2 words:    3
3 words:    4
4 words:    2
5 words:    1
Name: fruits, dtype: int64

时间

In [42]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()

1000 loops, best of 3: 799 µs per loop
1000 loops, best of 3: 347 µs per loop

对于 6K df:

In [51]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()

100 loops, best of 3: 6.3 ms per loop
100 loops, best of 3: 6 ms per loop

关于python - 如何计算DataFrame中字符串中的单词数？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37483470/

上一篇：python - pandas:groupby 和聚合而不会丢失分组的列

下一篇：python - 平滑二值图像的边缘

python - 将 pandas 数据帧格式从宽更改为长，类似于 pd.melt

python - Intellij IDEA 不在构面列表中显示 Django

python - 将百分位数传递给 pandas agg 函数

Python Pandas 如果 B 列中的值 = 等于 [X, Y, Z] 将 A 列替换为 "T"

r - 将R中的数据帧输出到.csv

python - 理解 dict.copy() - 浅的还是深的？

python - 计算列中的时间值

r - 在 R 中创建等级变量的有效方法

python - 根据特定条件过滤数据