python - Pandas 数据帧频率

我有这个数据框:

source target
0     ape    dog
1     ape   hous
2     dog   hous
3    hors    dog
4    hors    ape
5     dog    ape
6     ape   bird
7     ape   hous
8    bird   hous
9    bird   fist
10   bird    ape
11   fist    ape

我正在尝试使用此代码生成频率计数:

df_count =df.groupby(['source', 'target']).size().reset_index().sort_values(0, ascending=False)
df_count.columns = ['source', 'target', 'weight']

我得到以下结果。

source target  weight
2     ape   hous       2
0     ape   bird       1
1     ape    dog       1
3    bird    ape       1
4    bird   fist       1
5    bird   hous       1
6     dog    ape       1
7     dog   hous       1
8    fist    ape       1
9    hors    ape       1
10   hors    dog       1

我如何修改代码，使方向无关紧要，即不是 ape bird 1 和 bird ape 1，我得到 ape bird 2 ?

最佳答案

首先按行对值进行排序。

In [31]: df
Out[31]:
   source target
0     ape    dog
1     ape   hous
2     dog   hous
3    hors    dog
4    hors    ape
5     dog    ape
6     ape   bird
7     ape   hous
8    bird   hous
9    bird   fist
10   bird    ape
11   fist    ape

In [32]: df.values.sort()

In [33]: df
Out[33]:
   source target
0     ape    dog
1     ape   hous
2     dog   hous
3     dog   hors
4     ape   hors
5     ape    dog
6     ape   bird
7     ape   hous
8    bird   hous
9    bird   fist
10    ape   bird
11    ape   fist

然后，groupby对source, target，按大小聚合，对结果进行排序。

In [34]: df.groupby(['source', 'target']).size().sort_values(ascending=False)
    ...:   .reset_index(name='weight')
Out[34]:
  source target  weight
0    ape   hous       2
1    ape    dog       2
2    ape   bird       2
3    dog   hous       1
4    dog   hors       1
5   bird   hous       1
6   bird   fist       1
7    ape   hors       1
8    ape   fist       1

关于python - Pandas 数据帧频率，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41084113/

python - Pandas 数据帧频率

上一篇：python - Pyspark 将多个 csv 文件读入数据框(或 RDD？)

下一篇：python - 使用python打印月份和日期