我创建了一个具有重复行的数据框,如下所示:
df = pd.DataFrame({"Order Date": ["January 1, 2017", "March 15, 2017", "April 20, 2017", "June 23, 2017", "December 12, 2017", None, "April 20, 2017", "April 20, 2017"],
"Sales Person": ["John", "John", "Rick", "Mary", "Mary", "Rick", "Rick", "Rick"],
"Items Sold": [4, -999, 1, np.nan, 7, 3, 1, 1],
"Item Price": [4.99, 1.99, 9.99, 19.99, 0.99, 2.99, 9.99, 9.99]})
如果我得到重复项,它会正确显示重复的两行。
df[df.duplicated()]
然后,我调用 drop_duplicates
删除第二个重复项并保留第一个。
df.drop_duplicates()
但是,看起来它删除了两行而不是保留第一行。我是否在 drop_duplicates
方法中遗漏了某些内容?文档字符串表明 keep
参数默认为 first
,即使我明确输入该参数,这种情况仍然会发生。
最佳答案
您的示例中有三个重复的行,使用 keep= False
查看全部
df[df.duplicated(keep=False)]
Out[661]:
Item Price Items Sold Order Date Sales Person
2 9.99 1.0 April 20, 2017 Rick
6 9.99 1.0 April 20, 2017 Rick
7 9.99 1.0 April 20, 2017 Rick
然后,您drop_duplicates
将只保留第3行索引=2处的第一个
df.drop_duplicates()
Out[659]:
Item Price Items Sold Order Date Sales Person
0 4.99 4.0 January 1, 2017 John
1 1.99 -999.0 March 15, 2017 John
2 9.99 1.0 April 20, 2017 Rick
3 19.99 NaN June 23, 2017 Mary
4 0.99 7.0 December 12, 2017 Mary
5 2.99 3.0 None Rick
关于python - Pandas `drop_duplicates` 不保留第一行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47659385/