我有一个 500 万行的客户、周期和订单列表,需要附加 3、6 和 12 个月的滚动窗口查找。以下是数据示例:
我有一个数据框 dftest1:
{'period': {0: 201810, 1: 201811, 2: 201812, 3: 201901, 4: 201902, 5: 201903, 6: 201904, 7: 201905, 8: 201906, 9: 201907, 10: 201908, 11: 201909, 12: 201910, 13: 201911, 14: 201912, 15: 202001, 16: 202002, 17: 202003, 18: 202004, 19: 202005, 20: 202006}, 'customer': {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC', 5: 'ABC', 6: 'ABC', 7: 'ABC', 8: 'ABC', 9: 'ABC', 10: 'ABC', 11: 'ABC', 12: 'ABC', 13: 'ABC', 14: 'ABC', 15: 'ABC', 16: 'ABC', 17: 'ABC', 18: 'ABC', 19: 'ABC', 20: 'ABC'}, 'orders': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0, 10: 0.0, 11: 1.0, 12: 0.0, 13: 0.0, 14: 0.0, 15: 1.0, 16: 0.0, 17: 2.0, 18: 1.0, 19: 1.0, 20: 1.0}}
现在,以下程序运行并交付了我想要的内容,但在我的完整数据集上运行需要很长时间:
dftest1['countOfOrdersLast3months']=''
dftest1['countOfOrdersLast6months']=''
dftest1['countOfOrdersLast12months']=''
for x, y in dftest1.iterrows():
customer = dftest1['customer'].values[x]
currentDate = dftest1['period'].values[x]
dftest1['countOfOrdersLast3months'].values[x] = dftest1[(dftest1['customer']==customer) & ((dftest1['period'] > currentDate-3) & (dftest1['period'] <= currentDate))]['orders'].sum()
dftest1['countOfOrdersLast6months'].values[x] = dftest1[(dftest1['customer']==customer) & ((dftest1['period'] > currentDate-6) & (dftest1['period'] <= currentDate))]['orders'].sum()
dftest1['countOfOrdersLast12months'].values[x] = dftest1[(dftest1['customer']==customer) & ((dftest1['period'] > currentDate-12) & (dftest1['period'] <= currentDate))]['orders'].sum()
dftest1.head()
有没有更好的方法可以更快地做到这一点?
最佳答案
你可以试试这样的东西
df['cumorders'] = df.groupby('customer')['orders'].cumsum()
df['countOfOrdersLast3months'] = df.groupby('customer')['cumorders'].diff(3)
和类似的 6m 等
这里在第一行我们计算累计订单(每个客户),在第二行我们对 cumorders 进行 3 步差,即 3 个月内的订单
这是我从你的例子中得到的
period customer orders cumorders countOfOrdersLast3months
0 201810 ABC 0.0 0.0 NaN
1 201811 ABC 0.0 0.0 NaN
2 201812 ABC 0.0 0.0 NaN
3 201901 ABC 0.0 0.0 0.0
4 201902 ABC 0.0 0.0 0.0
5 201903 ABC 0.0 0.0 0.0
6 201904 ABC 0.0 0.0 0.0
7 201905 ABC 0.0 0.0 0.0
8 201906 ABC 0.0 0.0 0.0
9 201907 ABC 0.0 0.0 0.0
10 201908 ABC 0.0 0.0 0.0
11 201909 ABC 1.0 1.0 1.0
12 201910 ABC 0.0 1.0 1.0
13 201911 ABC 0.0 1.0 1.0
14 201912 ABC 0.0 1.0 0.0
15 202001 ABC 1.0 2.0 1.0
16 202002 ABC 0.0 2.0 1.0
17 202003 ABC 2.0 4.0 3.0
18 202004 ABC 1.0 5.0 3.0
19 202005 ABC 1.0 6.0 4.0
20 202006 ABC 1.0 7.0 3.0
关于python - 如何基于滚动窗口查找将列附加到 Panda 数据框?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65012962/