python - 如何分析此 Pandas DataFrame 中的所有重复条目？

我希望能够计算 Pandas DataFrame 中数据的描述性统计数据，但我只关心重复的条目。例如，假设我有由以下人员创建的 DataFrame:

import pandas as pd
data={'key1':[1,2,3,1,2,3,2,2],'key2':[2,2,1,2,2,4,2,2],'data':[5,6,2,6,1,6,2,8]}
frame=pd.DataFrame(data,columns=['key1','key2','data'])
print frame


     key1  key2  data
0     1     2     5
1     2     2     6
2     3     1     2
3     1     2     6
4     2     2     1
5     3     4     6
6     2     2     2
7     2     2     8

如您所见，第 0、1、3、4、6 和 7 行都是重复的(使用“key1”和“key2”。但是，如果我像这样索引这个 DataFrame:

frame[frame.duplicated(['key1','key2'])]

我明白了

   key1  key2  data
3     1     2     6
4     2     2     1
6     2     2     2
7     2     2     8

(即第 1 行和第 2 行不显示，因为它们没有被复制方法索引为 True)。

这是我的第一个问题。我的第二个问题涉及如何从这些信息中提取描述性统计数据。暂时忘记丢失的重复项，假设我想计算重复条目的 .min() 和 .max() (这样我就可以获得一个范围)。我可以像这样在 groupby 对象上使用 groupby 和这些方法:

a.groupby(['key1','key2']).min()

这给了

           key1  key2  data
key1 key2                  
1    2        1     2     6
2    2        2     2     1

我想要的数据显然在这里，但我提取它的最佳方法是什么？如何索引生成的对象以获得我想要的(即 key1、key2、数据信息)？

最佳答案

编辑 Pandas 0.17 或更高版本:

因为 duplicated() 方法的 take_last 参数是 deprecated从 Pandas 0.17 开始支持新的 keep 参数，请参阅 this answer对于正确的方法:

使用 keep=False 调用 duplicated() 方法，即 frame.duplicated(['key1', 'key2'], keep=False) .

因此，为了提取此特定问题所需的数据，以下内容就足够了:

In [81]: frame[frame.duplicated(['key1', 'key2'], keep=False)].groupby(('key1', 'key2')).min()
Out[81]: 
           data
key1 key2      
1    2        5
2    2        1

[2 rows x 1 columns]

Interestingly enough, this change in Pandas 0.17 may be partially attributed to this question, as referred to in this issue.

对于 Pandas 0.17 之前的版本:

我们可以使用 duplicated() 的 take_last 参数方法:

take_last: boolean, default False

For a set of distinct duplicate rows, flag all but the last row as duplicated. Default is for all but the first row to be flagged.

如果我们将 take_last 的值设置为 True，我们会标记除最后一个重复行之外的所有行。将它与默认值 False 结合起来，它会标记除第一个重复行之外的所有行，我们可以标记所有重复行:

In [76]: frame.duplicated(['key1', 'key2'])
Out[76]: 
0    False
1    False
2    False
3     True
4     True
5    False
6     True
7     True
dtype: bool

In [77]: frame.duplicated(['key1', 'key2'], take_last=True)
Out[77]: 
0     True
1     True
2    False
3    False
4     True
5    False
6     True
7    False
dtype: bool

In [78]: frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])
Out[78]: 
0     True
1     True
2    False
3     True
4     True
5    False
6     True
7     True
dtype: bool

In [79]: frame[frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])]
Out[79]: 
   key1  key2  data
0     1     2     5
1     2     2     6
3     1     2     6
4     2     2     1
6     2     2     2
7     2     2     8

[6 rows x 3 columns]

现在我们只需要使用 groupby和 min 方法，我相信输出是所需的格式:

In [81]: frame[frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])].groupby(('key1', 'key2')).min()
Out[81]: 
           data
key1 key2      
1    2        5
2    2        1

[2 rows x 1 columns]

关于python - 如何分析此 Pandas DataFrame 中的所有重复条目？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/26244309/

python - 如何分析此 Pandas DataFrame 中的所有重复条目？

上一篇：Python:用渐进式数字重命名列表中的重复项而不对列表进行排序

下一篇：python - 枚举上的随机选择