我正在尝试找出早些年是否出现过任何 ID
(即 dfo
中的 Duplicate
列)。如果是这样,我想将该行标记为重复,并包含 ID
首次出现的年份(即 Year_Duplicate
)。
我确实有一个工作代码。
Objective: I want to learn better (or 'pythonic') way to solve this problem in a better way i.e. if there is more condense way to solve it, I'd appreciate any help. I'm not too familiar with all the features we get with
numpy
andpandas
示例输入
dfi.to_dict() =
{'Year': {0: 2020,
1: 2020,
2: 2020,
3: 2021,
4: 2021,
5: 2021,
6: 2022,
7: 2022,
8: 2022},
'ID': {0: 1, 1: 2, 2: 3, 3: 1, 4: 4, 5: 2, 6: 5, 7: 1, 8: 4},
'$': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3}}
示例输出:
dfo.to_dict()
{'Year': {0: 2020,
1: 2020,
2: 2020,
3: 2021,
4: 2021,
5: 2021,
6: 2022,
7: 2022,
8: 2022},
'ID': {0: 1, 1: 2, 2: 3, 3: 1, 4: 4, 5: 2, 6: 5, 7: 1, 8: 4},
'$': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3},
'Duplicate': {0: False,
1: False,
2: False,
3: True,
4: False,
5: True,
6: False,
7: True,
8: True},
'Year_Duplicate': {0: nan,
1: nan,
2: nan,
3: 2020.0,
4: nan,
5: 2020.0,
6: nan,
7: 2020.0,
8: 2021.0}}
工作代码:
import pandas as pd
from numpy import nan as NA
dfi=pd.DataFrame.from_dict(dfi)
dfo=pd.DataFrame.from_dict(dfo)
df_process = dfi.copy()
df_process['Duplicate']=df_process['ID'].duplicated()
indexes=df_process.groupby('ID')['Year'].idxmin
df_min_year = df_process[['Year','ID']].loc[indexes]
df_min_year=df_min_year.rename(columns={"Year": "Year_Duplicate"})
df_process=pd.merge(df_process,df_min_year,on=['ID'],how='left')
df_process.loc[df_process['Year_Duplicate']==df_process['Year'],'Year_Duplicate']=NA
dfo.equals(df_process) #returns TRUE
我很乐意回答任何澄清问题。谢谢你帮助我。
以下评论的澄清:
$
只是一个表示销售额的数字。它可以被忽略 复制。Year_Duplicate
向我们显示该 ID 的第一年 发生。如果没有重复的话就不需要Year_Duplicate
在这种情况下,我们会将其留空。
最佳答案
使用Series.duplicated
与 Series.where
和 GroupBy.transform
与 GroupBy.first
:
df['Year_Duplicated']=df.groupby('ID')['Year'].transform('first').where(df['ID'].duplicated())
print (df)
Year ID $ Year_Duplicated
0 2020 1 1 NaN
1 2020 2 1 NaN
2 2020 3 1 NaN
3 2021 1 2 2020.0
4 2021 4 2 NaN
5 2021 2 2 2020.0
6 2022 5 3 NaN
7 2022 1 3 2020.0
8 2022 4 3 2021.0
详细信息:
print (df.groupby('ID')['Year'].transform('first'))
0 2020
1 2020
2 2020
3 2020
4 2021
5 2020
6 2022
7 2020
8 2021
Name: Year, dtype: int64
关于python - 在分组变量中按先前值(年份)标记滚动重复项,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58062128/