python - Pandas groupby 可变时间间隔

我正在努力寻找使用某些约束对 DataFrame 进行分组的正确方法。我有以下数据框:

           start_dt  machine     benchmark      value1  value2  value3
2021-06-07 07:32:01  A           bench1         0       0       0
2021-06-07 07:32:37  A           bench1         0       0       0
2021-06-07 07:33:13  A           bench1         0       0       0
2021-06-07 07:33:49  A           bench1         0       0       0
2021-06-07 07:34:26  A           bench1         0       0       0
2021-06-07 08:30:26  A           bench1         0       0       10
2021-06-07 11:12:21  A           bench1         0       0       6
2021-06-07 12:05:21  A           bench1         1       0       10
2021-06-17 12:28:57  A           bench2         0       0       0
2021-06-17 12:29:29  A           bench2         0       0       0
2021-06-17 12:33:09  A           bench2         3       0       1
2021-06-17 12:33:48  A           bench2         3       0       1
2021-06-17 12:35:17  A           bench2         0       0       0

我想根据机器、基准和 start_dt 列进行分组。但是，它对 start_dt 列有一些限制。 start_dt 组标准必须位于 1h block 上。我尝试了以下命令:

df.groupby(["machine", "benchmark", pd.Grouper(key="start_dt", freq="1h", sort=True, origin="start")]).sum()

但是，它会根据所有基准测试的第一个日期时间对数据帧进行分组，我不希望这样。我想要的是类似下面的内容，其中 end_dt 是 start_dt + 1h。

machine benchmark          start_dt               end_dt   value1  value2  value3              
A       bench1  2021-06-07 07:32:01  2021-06-07 08:32:01   0          0    10
                2021-06-07 11:12:21  2021-06-07 12:12:21   1          0    16
        bench2  2021-06-17 12:28:57  2021-06-17 13:28:57   6          0    2

例如机器A和benchmark bench1至少有两个时间间隔

2021-06-07 07:32:01 2021-06-07 08:32:01

2021-06-07 11:12:21 2021-06-07 12:12:21

但中间没有任何内容，因此我想保持时间间隔，因为它们出现在列上，而不是 pandas Grouper 给我的时间间隔。可能吗？

编辑:

时间戳是唯一的

最佳答案

是的，这是可能的，您只需要创建一个自定义分组函数来处理用例的不一致性。在下面的解决方案中，我首先创建一个新列 end_dt，稍后我们将其用作最内层的分组索引。为了创建这个列，我们正在调用函数 get_end_times() 使用 start_dt 列，该列将获取每个组 (machine/benchmark 组合)并调用 run_calc() 内部函数。此函数使用传递给函数的数据帧切片中的第一个 start_dt 来确定设置端点的位置(1 小时后)。然后它检查哪些元素落在该范围内并返回 end_dt 的集合，该集合将被重新分配给调用内部函数的组。这将迭代，直到所有 start_dt 值都被分配了一个 end_dt 值(通过 (~f).all() 检查)。完整实现见下文:

def run_calc(x):

    i = (x - x.iloc[0]).dt.total_seconds()>3600

    x[~i] = x.iloc[0] + np.timedelta64(1, 'h')

    return x, i

def get_end_times(group):

    f = pd.Series([True]*len(group), index=group.index)

    iterate = True

    while iterate:
        new, f = run_calc(group[f])
        group[(~f).index] = new
        if (~f).all(): iterate = False

    return group

df['end_dt'] = df.groupby(['machine','benchmark'])['start_dt'].transform(get_end_times)

df.groupby(['machine','benchmark','end_dt']).agg({'start_dt': 'first', 'value1': 'sum', 'value2': 'sum', 'value3': 'sum'}) \
    .reset_index().set_index(['machine','benchmark','start_dt','end_dt'])

产量:

                                                           value1  value2  \
machine benchmark start_dt            end_dt                                
A       bench1    2021-06-07 07:32:01 2021-06-07 08:32:01       0       0   
                  2021-06-07 11:12:21 2021-06-07 12:12:21       1       0   
        bench2    2021-06-17 12:28:57 2021-06-17 13:28:57       6       0   

                                                           value3  
machine benchmark start_dt            end_dt                       
A       bench1    2021-06-07 07:32:01 2021-06-07 08:32:01      10  
                  2021-06-07 11:12:21 2021-06-07 12:12:21      16  
        bench2    2021-06-17 12:28:57 2021-06-17 13:28:57       2

关于python - Pandas groupby 可变时间间隔，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/68154495/

python - Pandas groupby 可变时间间隔

上一篇：车前草。如何创建有限状态机图？

下一篇：django-rest-framework - 如何将字段添加到与另一个模型具有反向关系的模型序列化程序