python - 计算两个流程步骤之间的平均天数

标签 python pandas dataframe

我有一个问题。我想计算两个流程步骤之间的时间( taskN+1_start - taskN_end )并求平均值。我该如何做到这一点,然后我想将其与流程的平均天数合并?

我尝试了一些方法来计算两个过程步骤之间的时间。该过程的平均持续时间对我来说很有效。

import pandas as pd 
d = {'id': [1, 1, 1, 1, 1, 1,
            2, 2, 2, 2, 2, 2],
    'step': ['Task1_Start', 'Task1_End1', 'Task2_Start', 'Task2_End', 'Task3_Start', 'Task3_End',
              'Task1_Start', 'Task1_End1', 'Task2_Start', 'Task2_End', 'Task3_Start', 'Task3_End',],
     'timestamp': ['2023-01-01', '2023-01-05', '2023-01-10', '2023-01-12', '2023-02-12', '2023-02-14',
               '2023-01-01', '2023-01-05', '2023-01-10', '2023-01-12', '2023-01-15', '2023-02-16',]}
df  = pd.DataFrame(data=d,)

[OUT]
    id  step    timestamp
0   1   Task1_Start     2023-01-01
1   1   Task1_End1  2023-01-05
2   1   Task2_Start     2023-01-10
3   1   Task2_End   2023-01-12
4   1   Task3_Start     2023-02-12
5   1   Task3_End   2023-02-14
6   2   Task1_Start     2023-01-01
7   2   Task1_End1  2023-01-05
8   2   Task2_Start     2023-01-10
9   2   Task2_End   2023-01-12
10  2   Task3_Start     2023-01-15
11  2   Task3_End   2023-02-16
df['task'] = df['step'].str.split('_').str[0]
df['timestamp'] = pd.to_datetime(df['timestamp'])

df['duration'] = df.groupby('task')['timestamp'].diff().dt.days

avg_duration = df.groupby('task')['duration'].mean().reset_index()
print(avg_duration)
[OUT]
    task  duration
0  Task1  1.333333
1  Task2  0.666667
2  Task3  1.333333
df['inter_task_duration'] = df.groupby('id')['timestamp'].diff().dt.days

avg_inter_task_duration = df.groupby('step')['inter_task_duration'].mean().reset_index()
print(avg_inter_task_duration)

[OUT]
          step  inter_task_duration
0   Task1_End1                  4.0
1  Task1_Start                  NaN
2    Task2_End                  2.0
3  Task2_Start                  5.0
4    Task3_End                 17.0
5  Task3_Start                 17.0

# calculate the avg days between two process steps

我想要什么

    task          duration
0  Task1          1.333333
1  Task2          0.666667
2  Task3          1.333333
3  Task1_to_Task2 ...
4  Task2_to_Task3 ...

最佳答案

我会稍微改变一下逻辑来执行以下操作:

df['timestamp'] = pd.to_datetime(df['timestamp'])

s = df['step'].str.extract('([^_]+)_(Start|End)')

out = (df
   .assign(task=s[0], step=s[1])
   .pivot(index=['id', 'task'], columns='step2', values='timestamp')
   .assign(task_duration=lambda d: d['End']-d['Start'],
           duration_from_previous_step=lambda d: d['Start']
                                                -d.groupby(level='id')['End'].shift())
   .rename_axis(columns=None).reset_index()
)

输出:

   id   task        End      Start task_duration duration_from_previous_step
0   1  Task1 2023-01-05 2023-01-01        4 days                         NaT
1   1  Task2 2023-01-12 2023-01-10        2 days                      5 days
2   1  Task3 2023-02-14 2023-02-12        2 days                     31 days
3   2  Task1 2023-01-05 2023-01-01        4 days                         NaT
4   2  Task2 2023-01-12 2023-01-10        2 days                      5 days
5   2  Task3 2023-02-16 2023-01-15       32 days                      3 days

那么你总是可以这样做:

(out.groupby('task').mean()
[['task_duration', 'duration_from_previous_step']].stack()
)

输出:

task                              
Task1  task_duration                  4 days
Task2  task_duration                  2 days
       duration_from_previous_step    5 days
Task3  task_duration                 17 days
       duration_from_previous_step   17 days
dtype: timedelta64[ns]

groupby.shift

如果您的行已完美排序,您还可以使用:

df['timestamp'] = pd.to_datetime(df['timestamp'])

g = df.groupby('id')

out = (df
    .assign(duration=df['timestamp'].sub(g['timestamp'].shift()),
            step=lambda d: (df['step']+'/'+g['step'].shift()).str.replace(
                 r'([^_]+)[^/]*/([^_]+)[^/]*',
                 lambda m: m.group(1) if m.group(1)==m.group(2) else f"{m.group(2)}_to_{m.group(1)}",
                 regex=True)
           )
   [['id', 'step', 'duration']].dropna(subset=['duration'])
)

输出:

    id            step duration
1    1           Task1   4 days
2    1  Task1_to_Task2   5 days
3    1           Task2   2 days
4    1  Task2_to_Task3  31 days
5    1           Task3   2 days
7    2           Task1   4 days
8    2  Task1_to_Task2   5 days
9    2           Task2   2 days
10   2  Task2_to_Task3   3 days
11   2           Task3  32 days

关于python - 计算两个流程步骤之间的平均天数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/76253420/

相关文章:

r - 如果任何剩余值为 0,则将值设置为 0

python - 在 python 中拆分文本但将逗号、句点等视为单独的 'words'

python - 如何折叠重叠间隔[开始-结束]并保持较小?

python - 我可以分享不同轴上多个 Pandas 地 block 的传说吗?

python - 将字典重新映射到数据帧的更快方法

python - 根据其他列的值创建新列

python - 如何根据另一列滚动函数的结果计算 pandas DataFrame 列的值

python - 在 While 循环中从 Pandas Dataframe 中查找特定的数据行

python - 你如何在不知道错误的情况下断言两个函数抛出相同的错误?

python 使用 selenium,错误 : chrome unexpectedly exited. 状态代码为:0