python - 在分组变量中按先前值(年份)标记滚动重复项

我正在尝试找出早些年是否出现过任何 ID(即 dfo 中的 Duplicate 列)。如果是这样，我想将该行标记为重复，并包含 ID 首次出现的年份(即 Year_Duplicate)。

我确实有一个工作代码。

Objective: I want to learn better (or 'pythonic') way to solve this problem in a better way i.e. if there is more condense way to solve it, I'd appreciate any help. I'm not too familiar with all the features we get with numpy and pandas

示例输入

dfi.to_dict() = 
{'Year': {0: 2020,
  1: 2020,
  2: 2020,
  3: 2021,
  4: 2021,
  5: 2021,
  6: 2022,
  7: 2022,
  8: 2022},
 'ID': {0: 1, 1: 2, 2: 3, 3: 1, 4: 4, 5: 2, 6: 5, 7: 1, 8: 4},
 '$': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3}}

示例输出:

dfo.to_dict()
{'Year': {0: 2020,
  1: 2020,
  2: 2020,
  3: 2021,
  4: 2021,
  5: 2021,
  6: 2022,
  7: 2022,
  8: 2022},
 'ID': {0: 1, 1: 2, 2: 3, 3: 1, 4: 4, 5: 2, 6: 5, 7: 1, 8: 4},
 '$': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3},
 'Duplicate': {0: False,
  1: False,
  2: False,
  3: True,
  4: False,
  5: True,
  6: False,
  7: True,
  8: True},
 'Year_Duplicate': {0: nan,
  1: nan,
  2: nan,
  3: 2020.0,
  4: nan,
  5: 2020.0,
  6: nan,
  7: 2020.0,
  8: 2021.0}}

工作代码:

import pandas as pd
from numpy import nan as NA

dfi=pd.DataFrame.from_dict(dfi)
dfo=pd.DataFrame.from_dict(dfo)

df_process = dfi.copy()
df_process['Duplicate']=df_process['ID'].duplicated()

indexes=df_process.groupby('ID')['Year'].idxmin
df_min_year = df_process[['Year','ID']].loc[indexes]
df_min_year=df_min_year.rename(columns={"Year": "Year_Duplicate"})

df_process=pd.merge(df_process,df_min_year,on=['ID'],how='left')
df_process.loc[df_process['Year_Duplicate']==df_process['Year'],'Year_Duplicate']=NA

dfo.equals(df_process) #returns TRUE

我很乐意回答任何澄清问题。谢谢你帮助我。

以下评论的澄清:

$ 只是一个表示销售额的数字。它可以被忽略复制。
Year_Duplicate 向我们显示该 ID 的第一年发生。如果没有重复的话就不需要 Year_Duplicate 在这种情况下，我们会将其留空。

最佳答案

使用Series.duplicated与 Series.where和 GroupBy.transform与 GroupBy.first :

df['Year_Duplicated']=df.groupby('ID')['Year'].transform('first').where(df['ID'].duplicated())
print (df)
   Year  ID  $  Year_Duplicated
0  2020   1  1              NaN
1  2020   2  1              NaN
2  2020   3  1              NaN
3  2021   1  2           2020.0
4  2021   4  2              NaN
5  2021   2  2           2020.0
6  2022   5  3              NaN
7  2022   1  3           2020.0
8  2022   4  3           2021.0

详细信息:

print (df.groupby('ID')['Year'].transform('first'))
0    2020
1    2020
2    2020
3    2020
4    2021
5    2020
6    2022
7    2020
8    2021
Name: Year, dtype: int64

关于python - 在分组变量中按先前值(年份)标记滚动重复项，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58062128/

python - 在分组变量中按先前值(年份)标记滚动重复项

上一篇：python - 组合多个字典列表

下一篇：python - 根据正值和负值对数据框列的值进行排序？