假设我们有这样的 df(用户在同一日期可能有多行):
df = pd.DataFrame({"user_id" : ["A"] * 5 + ["B"] * 5,
"hour" : [10] * 10,
"date" : ["2018-01-16", "2018-01-16","2018-01-18","2018-01-19","2018-02-16","2018-01-16", "2018-01-16","2018-01-18","2018-01-19","2018-02-16"], "amount" : [1] * 10})
df['date'] = pd.to_datetime(df['date'])
输出:
amount date hour user_id
0 1 2018-01-16 10 A
1 1 2018-01-16 10 A
2 1 2018-01-18 10 A
3 1 2018-01-19 10 A
4 1 2018-02-16 10 A
5 1 2018-01-16 10 B
6 1 2018-01-16 10 B
7 1 2018-01-18 10 B
8 1 2018-01-19 10 B
9 1 2018-02-16 10 B
我想获取每个user_id
和小时
的金额
的聚合滚动统计数据
。目前我是这样做的:
def get_rolling_stats(df, rolling_interval = 7) :
index_cols = ['user_id', 'hour', 'date']
grp = df.groupby(by = ['user_id', 'hour'], as_index = True, group_keys = False).rolling(window='%sD'%rolling_interval, on = 'date')
def agg_grp(grp, func):
res = grp.agg({'amount' : func})
res = res.reset_index()
res.drop_duplicates(index_cols, inplace = True, keep = 'last')
res.rename(columns = {'amount' : "amount_" + func}, inplace = True)
return res
grp1 = agg_grp(grp, "mean")
grp2 = agg_grp(grp, "count")
grp = grp1.merge(grp2, on = index_cols)
return grp
所以它输出:
user_id hour date amount_mean amount_count
0 A 10 2018-01-16 1.0 1.0
1 A 10 2018-01-18 1.0 3.0
2 A 10 2018-01-19 1.0 4.0
3 A 10 2018-02-16 1.0 1.0
4 B 10 2018-01-16 1.0 1.0
5 B 10 2018-01-18 1.0 3.0
6 B 10 2018-01-19 1.0 4.0
7 B 10 2018-02-16 1.0 1.0
但我想从滚动窗口中排除当前日期。所以我想要这样的输出:
user_id hour date amount_mean amount_count
0 A 10 2018-01-16 nan 0.0
1 A 10 2018-01-18 1.0 2.0
2 A 10 2018-01-19 1.0 3.0
3 A 10 2018-02-16 nan 0.0
4 B 10 2018-01-16 nan 0.0
5 B 10 2018-01-18 1.0 2.0
6 B 10 2018-01-19 1.0 3.0
7 B 10 2018-02-16 nan 0.0
我读到rolling
方法有argclose
。但如果我使用它 - 它会引发错误:ValueError:仅针对日期时间和基于偏移的窗口实现了关闭
。我还没有找到任何如何使用它的示例。有人可以阐明如何正确实现 get_rolling_stats
函数吗?
最佳答案
好像我找到了例子 - https://pandas.pydata.org/pandas-docs/stable/computation.html#rolling-window-endpoints 。我所要做的就是替换:
grp = df.groupby(by = ['user_id', 'hour'], as_index = True, group_keys = False).rolling(window='%sD'%rolling_interval, on = 'date')
由
grp = df.set_index('date').groupby(by = ['user_id', 'hour'], as_index = True, group_keys = False).\
rolling(window='%sD'%rolling_interval, closed = 'neither')
关于python - Pandas 通过滚动打开的窗口进行分组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49322597/