我有一个问题。我想计算两个流程步骤之间的时间( taskN+1_start - taskN_end
)并求平均值。我该如何做到这一点,然后我想将其与流程的平均天数合并?
我尝试了一些方法来计算两个过程步骤之间的时间。该过程的平均持续时间对我来说很有效。
import pandas as pd
d = {'id': [1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2],
'step': ['Task1_Start', 'Task1_End1', 'Task2_Start', 'Task2_End', 'Task3_Start', 'Task3_End',
'Task1_Start', 'Task1_End1', 'Task2_Start', 'Task2_End', 'Task3_Start', 'Task3_End',],
'timestamp': ['2023-01-01', '2023-01-05', '2023-01-10', '2023-01-12', '2023-02-12', '2023-02-14',
'2023-01-01', '2023-01-05', '2023-01-10', '2023-01-12', '2023-01-15', '2023-02-16',]}
df = pd.DataFrame(data=d,)
[OUT]
id step timestamp
0 1 Task1_Start 2023-01-01
1 1 Task1_End1 2023-01-05
2 1 Task2_Start 2023-01-10
3 1 Task2_End 2023-01-12
4 1 Task3_Start 2023-02-12
5 1 Task3_End 2023-02-14
6 2 Task1_Start 2023-01-01
7 2 Task1_End1 2023-01-05
8 2 Task2_Start 2023-01-10
9 2 Task2_End 2023-01-12
10 2 Task3_Start 2023-01-15
11 2 Task3_End 2023-02-16
df['task'] = df['step'].str.split('_').str[0]
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['duration'] = df.groupby('task')['timestamp'].diff().dt.days
avg_duration = df.groupby('task')['duration'].mean().reset_index()
print(avg_duration)
[OUT]
task duration
0 Task1 1.333333
1 Task2 0.666667
2 Task3 1.333333
df['inter_task_duration'] = df.groupby('id')['timestamp'].diff().dt.days
avg_inter_task_duration = df.groupby('step')['inter_task_duration'].mean().reset_index()
print(avg_inter_task_duration)
[OUT]
step inter_task_duration
0 Task1_End1 4.0
1 Task1_Start NaN
2 Task2_End 2.0
3 Task2_Start 5.0
4 Task3_End 17.0
5 Task3_Start 17.0
# calculate the avg days between two process steps
我想要什么
task duration
0 Task1 1.333333
1 Task2 0.666667
2 Task3 1.333333
3 Task1_to_Task2 ...
4 Task2_to_Task3 ...
最佳答案
我会稍微改变一下逻辑来执行以下操作:
df['timestamp'] = pd.to_datetime(df['timestamp'])
s = df['step'].str.extract('([^_]+)_(Start|End)')
out = (df
.assign(task=s[0], step=s[1])
.pivot(index=['id', 'task'], columns='step2', values='timestamp')
.assign(task_duration=lambda d: d['End']-d['Start'],
duration_from_previous_step=lambda d: d['Start']
-d.groupby(level='id')['End'].shift())
.rename_axis(columns=None).reset_index()
)
输出:
id task End Start task_duration duration_from_previous_step
0 1 Task1 2023-01-05 2023-01-01 4 days NaT
1 1 Task2 2023-01-12 2023-01-10 2 days 5 days
2 1 Task3 2023-02-14 2023-02-12 2 days 31 days
3 2 Task1 2023-01-05 2023-01-01 4 days NaT
4 2 Task2 2023-01-12 2023-01-10 2 days 5 days
5 2 Task3 2023-02-16 2023-01-15 32 days 3 days
那么你总是可以这样做:
(out.groupby('task').mean()
[['task_duration', 'duration_from_previous_step']].stack()
)
输出:
task
Task1 task_duration 4 days
Task2 task_duration 2 days
duration_from_previous_step 5 days
Task3 task_duration 17 days
duration_from_previous_step 17 days
dtype: timedelta64[ns]
与 groupby.shift
如果您的行已完美排序,您还可以使用:
df['timestamp'] = pd.to_datetime(df['timestamp'])
g = df.groupby('id')
out = (df
.assign(duration=df['timestamp'].sub(g['timestamp'].shift()),
step=lambda d: (df['step']+'/'+g['step'].shift()).str.replace(
r'([^_]+)[^/]*/([^_]+)[^/]*',
lambda m: m.group(1) if m.group(1)==m.group(2) else f"{m.group(2)}_to_{m.group(1)}",
regex=True)
)
[['id', 'step', 'duration']].dropna(subset=['duration'])
)
输出:
id step duration
1 1 Task1 4 days
2 1 Task1_to_Task2 5 days
3 1 Task2 2 days
4 1 Task2_to_Task3 31 days
5 1 Task3 2 days
7 2 Task1 4 days
8 2 Task1_to_Task2 5 days
9 2 Task2 2 days
10 2 Task2_to_Task3 3 days
11 2 Task3 32 days
关于python - 计算两个流程步骤之间的平均天数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/76253420/