python - Pandas 在 group by 之后连接一行，然后重新采样

我有以下数据框:

df dataframe: 
      item   date_buy    date_sell     profit window
1     shoes  2009-12-04  2021-08-14    0.22     10
2     shoes  2009-12-05  2010-09-19    1.5      10
3     shoes  2015-05-05  2020-15-15    7.3      10
4     shoes  2009-12-09  2021-08-14    0.82     4
5     shoes  2009-12-10  2010-09-20    4.5      4
6     shoes  2015-05-11  2020-15-16    1.8      4
7     hat    2009-12-04  2021-08-14    1.2      10
8     hat    2009-12-05  2010-09-19    2.25     10
9     hat    2015-05-05  2020-15-15    4.3      10
10    hat    2009-12-09  2021-08-14    3.2      4
11    hat    2009-12-10  2010-09-20    9.4      4
12    hat    2015-05-11  2020-15-16    1.8      4

我需要做的是使用 data_buy 作为键对今天的数据进行重新采样，并按 item 和 window 分隔数据。我所做的是按 item 和 window 对数据进行分组，对于每个组，我添加一个额外的行，与该组的最后一行完全相同，仅更改 data_buy code> 字段包含今天的日期，然后重新采样，但执行速度非常慢，因为我有数千个数据。

这是我的代码:

    data = data.set_index(pd.to_datetime(data ['date_buy']))
    resampled_data = data.groupby(['item', 'window']).apply(lambda x: resample(x, now())
    
def resample(df, today):
    df = pd.concat([df, df[df.index==df.index.max()].rename(index={df.index.max(): pd.to_datetime(today)})])
    df = df.asfreq('B', method='ffill')
    return df

结果是正确的，如下所示(与元素帽子类似):

df dataframe: 
      item   date_buy    date_sell     profit window
1     shoes  2009-12-04  2021-08-14    0.22     10
2     shoes  2009-12-05  2010-09-19    1.5      10
.
2     shoes  2015-05-04  2010-09-19    1.5      10
3     shoes  2015-05-05  2020-15-15    7.3      10
.
.
3     shoes  2022-09-15  2020-15-15    7.3      10
4     shoes  2009-12-09  2021-08-14    0.82     4
5     shoes  2009-12-10  2010-09-20    4.5      4
.
5     shoes  2015-05-10  2010-09-20    4.5      4
6     shoes  2015-05-11  2020-15-16    1.8      4
.
.
6     shoes  2022-09-15  2020-15-16    1.8      4

这个代码片段需要大约 30 秒才能执行，我想让它更快。我是否缺少一些 pandas 最佳实践以使其更快？

最佳答案

我可能有一个没有 .apply 的解决方案:

第一步 - 创建一个数据框 end_data，其中包含每个 item 的结束 date_buy 条目 -窗口组:

today = pd.Timestamp.today().floor('D')
end_data = (
    data
    .groupby(['item', 'window'], as_index=False)
    .agg({'date_buy': lambda c: today})
)

对于您的示例，如下所示:

    item  window   date_buy
0    hat       4 2022-09-15
1    hat      10 2022-09-15
2  shoes       4 2022-09-15
3  shoes      10 2022-09-15

第二步:

data['date_buy'] = pd.to_datetime(data['date_buy'])  # Just in case
data = (
    pd.concat([data, end_data])
    .set_index('date_buy', drop=True).sort_index()
    .groupby(['item', 'window'], as_index=False).resample('B').ffill()
    .fillna(method='ffill')
    .droplevel(0).reset_index()
)

将 date_buy 列转换为 datetime(可能已经是这种情况)。
在data末尾附加end_data。
使用列 date_buy 作为索引(删除该列)，然后对索引进行排序。仅当 date_buy 的 item-window block 尚未按升序排列时才需要进行排序。
现在将结果按 item-window、.resample('B') 对组进行分组，以便根据您的要求进行上采样，并对结果使用 .ffill。
然后通过前向填充来填充剩余的 NaN/NaT。
最后删除第一个索引级别，并将上采样的 date_buy-index 重置为列。

您的示例结果如下所示:

        date_buy   item   date_sell  profit  window
0     2009-12-09    hat  2021-08-14     3.2       4
1     2009-12-10    hat  2010-09-20     9.4       4
2     2009-12-11    hat  2010-09-20     9.4       4
3     2009-12-14    hat  2010-09-20     9.4       4
4     2009-12-15    hat  2010-09-20     9.4       4
...          ...    ...         ...     ...     ...
13329 2022-09-09  shoes  2020-15-15     7.3      10
13330 2022-09-12  shoes  2020-15-15     7.3      10
13331 2022-09-13  shoes  2020-15-15     7.3      10
13332 2022-09-14  shoes  2020-15-15     7.3      10
13333 2022-09-15  shoes  2020-15-15     7.3      10

[13334 rows x 5 columns]

(date_sell 列包含无效日期。)

关于python - Pandas 在 group by 之后连接一行，然后重新采样，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/73727917/

python - Pandas 在 group by 之后连接一行，然后重新采样

上一篇：java - 如何在 swagger ui 中为 v3/api-docs 设置编码 utf-8？

下一篇：json - 通过一个序列化器更新多个模型数据