Python pandas如何根据特殊时间范围条件计算行数？

我有一个像这样的数据框:

person  action_type        time
A           4           2014-11-10
A           4           2014-11-15
A           3           2014-11-16
A           1           2014-11-18
A           4           2014-11-19
B           4           2014-11-13
B           2           2014-11-15
B           4           2014-11-19

所以我想添加一个名为“action_4”的新列，它表示过去7天的人的action_type计数为4(不包括其本身)。结果如下:

person  action_type        time      action_4
A           4           2014-11-10      0
A           4           2014-11-15      1
A           3           2014-11-16      2
A           1           2014-11-18      1
A           4           2014-11-19      1
B           4           2014-11-13      0
B           2           2014-11-15      1
B           4           2014-11-19      1

由于我的数据框的形状是 21649900*3，所以请避免使用 for...in...。

最佳答案

这是我的方法。

我认为根据时间(例如 7 天)检查间隔总是非常昂贵，因此最好依靠观察数量。 (实际上，在最新的 pandas 版本中，他们引入了“时间感知”滚动，但我没有这方面的经验......)

所以我的方法是，对于每个人，强制每日频率，然后简单地计算过去 7 天内(不包括今天)发生的 action_4 的数量。我已经在代码中添加了注释，应该可以使其清晰，但请随时要求更多解释。

import pandas as pd
from io import StringIO

inp_str = u"""
person action_type time
A 4 2014-11-10
A 4 2014-11-15
A 3 2014-11-16
A 1 2014-11-18
A 4 2014-11-19
B 4 2014-11-13
B 2 2014-11-15
B 4 2014-11-19
"""

or_df = pd.read_csv(StringIO(inp_str), sep = " ").set_index('time')
or_df.index = pd.to_datetime(or_df.index)

# Find first and last date for each person
min_dates = or_df.groupby('person').apply(lambda x: x.index[0])
max_dates = or_df.groupby('person').apply(lambda x: x.index[-1])

# Resample each person to daily frequency so that 1 obs = 1 day
transf_df = pd.concat([el.reindex(pd.date_range(min_dates[pp], max_dates[pp], freq = 'D')) for pp, el in  or_df.groupby('person')])
# Forward fill person
transf_df.loc[:, 'person'] = transf_df['person'].ffill()
# Set a null value for action_type (possibly integer so you preserve the column type)
transf_df = transf_df.fillna({'action_type' : -1})

# For each person count the number of action 4, exluding today
result = transf_df.groupby('person').transform(lambda x: x.rolling(7, 1).apply(lambda y: len(y[y==4])).shift(1).fillna(0))
result.columns = ['action_4']

# Bring back to original index
pd.concat([transf_df, result], axis = 1).set_index('person', append = True).loc[or_df.set_index('person', append = True).index, :]

这给出了预期的输出:

                   action_type  action_4
time       person                       
2014-11-10 A               4.0       0.0
2014-11-15 A               4.0       1.0
2014-11-16 A               3.0       2.0
2014-11-18 A               1.0       1.0
2014-11-19 A               4.0       1.0
2014-11-13 B               4.0       0.0
2014-11-15 B               2.0       1.0
2014-11-19 B               4.0       1.0

关于Python pandas如何根据特殊时间范围条件计算行数？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43589076/

Python pandas如何根据特殊时间范围条件计算行数？

上一篇：python - 如何使用虚拟环境

下一篇：python - Pandas - 以 3 小时的间隔对非正则化数据的一分钟间隔进行重新采样，并用一定时间范围内的数据替换丢失的数据