我有以下数据框:
> df = pd.DataFrame( columns = ['Name','Change Date','Final Date'])
> df['Name'] = ['Alexandra','Alexandra','Alexandra','Alexandra','Bobby','Bobby']
> df['Change Date'] =['2019-04-12','2019-04-28','2019-05-21','2019-05-30','2019-03-11','2019-03-27']
> df['Final Date'] =['2019-04-15','2019-04-15','2019-05-27','2019-05-27','2019-03-20','2019-03-20']
我想删除所有重复项,但只保留更改日期最接近每个最终日期的行,以便提供以下数据框:
> df = pd.DataFrame( columns = ['Name','Change Date','Final Date'])
> df['Name'] = ['Alexandra','Alexandra','Bobby']
> df['Change Date'] =['2019-04-12','2019-05-30','2019-03-27']
> df['Final Date'] =['2019-04-15','2019-05-27','2019-03-20']
最佳答案
将两列都转换为日期时间,减去 Series.sub
并通过 Series.abs
获取绝对值。最后使用 DataFrameGroupBy.idxmin
获取每组最小值的索引并通过 DataFrame.loc
选择原始行:
df['Final Date'] = pd.to_datetime(df['Final Date'])
df['Change Date'] = pd.to_datetime(df['Change Date'])
df['diff'] = df['Final Date'].sub(df['Change Date']).abs()
df1 = df.loc[df.groupby(['Name','Final Date'])['diff'].idxmin()]
print (df1)
Name Change Date Final Date diff
0 Alexandra 2019-04-12 2019-04-15 3 days
3 Alexandra 2019-05-30 2019-05-27 3 days
5 Bobby 2019-03-27 2019-03-20 7 days
如果可能,每个组使用重复的最小值:
df1 = df[df.groupby(['Name','Final Date'])['diff'].transform('min').eq(df['diff'])]
或者,如果需要仅按 Name
列进行分组,并选择两个最小 3 天
值,则使用 GroupBy.transform
创建系列和 min
并按 diff
进行比较,最后按 boolean indexing
进行过滤:
df1 = df[df.groupby('Name')['diff'].transform('min').eq(df['diff'])]
print (df1)
Name Change Date Final Date diff
0 Alexandra 2019-04-12 2019-04-15 3 days
3 Alexandra 2019-05-30 2019-05-27 3 days
5 Bobby 2019-03-27 2019-03-20 7 days
关于python - 如何使用 python 根据日期接近程度的特定条件删除重复项?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57321272/