具有复杂标准的 python pandas 重复数据删除

我有一个下面的数据框:

import pandas as pd
d = {'id': [1, 2, 3, 4, 4, 6, 1, 8, 9], 'cluster': [7, 2, 3, 3, 3, 6, 7, 8, 8]}
df = pd.DataFrame(data=d)
df = df.sort_values('cluster')

我想保留所有行如果有相同的簇但不同的 id 并保留该簇中的每一行即使它是相同的 id，因为该集群内至少有一次不同的 id。我用来实现此目的的代码如下，但是，唯一的问题对于我正在寻找的内容来说，它丢弃了太多行。

df = (df.assign(counts=df.count(axis=1))
   .sort_values(['id', 'counts'])
   .drop_duplicates(['id','cluster'], keep='last')
   .drop('counts', axis=1))

输出数据帧我期望上面的代码不会这样做会在以下位置删除行数据帧索引 1、5、0 和 6，但保留数据帧索引 2、3、4、7 和 8。本质上导致以下代码产生的结果:

df = df.loc[[2, 3, 4, 7, 8]]

我已经看过许多有关堆栈溢出的重复数据删除 pandas 帖子，但尚未找到这个设想。任何帮助将不胜感激。

最佳答案

我认为我们可以用一个 bool 值来做到这一点。使用.groupby().nunique()

con1 = df.groupby('cluster')['id'].nunique() > 1

#of these we only want the True indexes.

cluster
2    False
3     True
6    False
7    False
8     True


df.loc[(df['cluster'].isin(con1[con1].index))]

   id  cluster
2   3        3
3   4        3
4   4        3
7   8        8
8   9        8

关于具有复杂标准的 python pandas 重复数据删除，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/66297414/

上一篇：javascript - 在 react 中从待办事项列表中删除项目

下一篇：webpack - 使用 babel regenerator-runtime 的异步/等待在 ie11 中不起作用

相关文章：

python - 如何用另一个数据框中的行减去数据框中的所有行？

python2 re.sub : abort catastrophic pattern on backtracking

python - 为什么 %config 行在 Python 3.7 中给出语法错误？

python - 将分组项保存到不同的 Excel 工作表

Python pandas - 按行选择

python - 更改 pandas DataFrame 中每个组的第一个元素

r - 将我的 excel 文件中的日期与系统日期进行比较

python - python3.2不支持mysql-connector-python？

python - Plotly:两行子图中的刻度标签问题

python - 使用 selenium 和 python 从 Iframe 获取文本