我有可以执行此操作的代码,但我正在使用 iterrows()
迭代数据帧的每一行。考虑到要检查超过 600 万行,处理需要相当长的时间。并希望使用矢量化来加快速度。
我研究过使用 pd.Grouper
和 freq
,但一直卡在如何使用 2 个数据帧来进行此检查。
鉴于以下 2 个数据框:
我想查看 df1
中的所有行(按 'sid'
和 'modtype'
分组):
df1:
sid servid date modtype service
0 123 881 2022-07-05 A1 z
1 456 879 2022-07-02 A2 z
然后在 df2
中找到它们,并计算这些组在 df1
中该组日期后 3 天内出现的次数,以获取该组出现的次数组在前 3 天内出现,出现次数在后 3 天内出现。
df2:
sid servid date modtype
0 123 1234 2022-07-03 A1
1 123 881 2022-07-05 A1
2 123 65781 2022-07-06 A1
3 123 8552 2022-07-30 A1
4 123 3453 2022-07-04 A2
5 123 5681 2022-07-07 A2
6 456 78 2022-07-01 A1
7 456 26744 2022-05-05 A2
8 456 56166 2022-06-29 A2
9 456 56717 2022-06-30 A2
10 456 879 2022-07-02 A2
11 456 56 2022-07-25 A2
因此,本质上,在我下面提供的示例集中,我的输出最终将是:
sid servid date modtype service cnt_3day_before cnt_3day_after
0 123 881 2022-07-05 A1 z 1 1
1 456 879 2022-07-02 A2 z 2 0
样本集:
import pandas as pd
data1 = {
'sid':['123','456'],
'servid':['881','879'],
'date':['2022-07-05','2022-07-02'],
'modtype':['A1','A2'],
'service':['z','z']}
df1 = pd.DataFrame(data1)
df1['date'] = pd.to_datetime(df1['date'])
df1 = df1.sort_values(by=['sid','modtype','date'], ascending=[True, True, True]).reset_index(drop=True)
data2 = {
'sid':['123','123','123','123','123','123',
'456','456','456','456','456','456'],
'servid':['1234','3453','881','65781','5681','8552',
'26744','56717','879','56166','56','78'],
'date':['2022-07-03','2022-07-04','2022-07-05','2022-07-06','2022-07-07','2022-07-30',
'2022-05-05','2022-06-30','2022-07-02','2022-06-29','2022-07-25','2022-07-01'],
'modtype':['A1','A2','A1','A1','A2','A1',
'A2','A2','A2','A2','A2','A1']}
df2 = pd.DataFrame(data2)
df2['date'] = pd.to_datetime(df2['date'])
df2 = df2.sort_values(by=['sid','modtype','date'], ascending=[True, True, True]).reset_index(drop=True)
最佳答案
带注释的代码
# Merge the dataframes on sid and modtype
keys = ['sid', 'modtype']
s = df2.merge(df1[[*keys, 'date']], on=keys, suffixes=['', '_'])
# Create boolean condtitions as per requirements
s['cnt_3day_after'] = s['date'].between(s['date_'], s['date_'] + pd.DateOffset(days=3), inclusive='right')
s['cnt_3day_before'] = s['date'].between(s['date_'] - pd.DateOffset(days=3), s['date_'], inclusive='left' )
# group the boolean conditions by sid and modtype
# and aggregate with sum to count the number of True values
s = s.groupby(keys)[['cnt_3day_after', 'cnt_3day_before']].sum()
# Join the aggregated counts back with df1
df_out = df1.join(s, on=keys)
结果
print(df_out)
sid servid date modtype service cnt_3day_after cnt_3day_before
0 123 881 2022-07-05 A1 z 1 1
1 456 879 2022-07-02 A2 z 0 2
关于python - 使用 groupby 有效计算日期范围内某个值的出现次数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/73280289/