python - 如何将groupby中的行与多个条件组合起来？

我想将 pandas df 中的行与以下逻辑组合:

数据框按用户分组
行按 start_at_min 排序
行在以下情况下组合:

案例A: 如果start_at_min<=200:

行1[停止时间] - 行2[开始时间] < 5
(例如:101 -100 = 1 -> 合并；200-100=100: -> 不合并)

案例 Bif 200> start_at_min<400:

将阈值更改为 3

情况 C 如果 start_at_min>400:

切勿合并

示例 df

   user  start_at_min  stop_at_min
0     1           100          150  
1     1           152          201 #row0 with row1 combine
2     1           205          260 #row1 with row 2 NO -> start_at_min above 200 -> threshol =3 
3     2            65          100 #no
4     2           200          265 #no
5     2           300          451 #no
6     2           452          460 #no -> start_at_min above 400-> never combine

预期输出:

   user  start_at_min  stop_at_min
0     1           100          201 #row1 with row2 combine
2     1           205          260 #row2 with row 3 NO -> start_at_min above 200 -> threshol =3 
3     2            65          100 #no
4     2           200          265 #no
5     2           300          451 #no
6     2           452          460 #no -> start_at_min above 400-> never combine

我已经编写了 merge_rows 函数，它接受 2 系列并应用此逻辑

def combine_rows (s1:pd.Series, s2:pd.Series):
  # take 2 rows and combine them if start_at_min row2 - stop_at_min row1 < 5 
  if s2['start_at_min'] - s1['stop_at_min'] <5: 
     return pd.Series({
         'user': s1['user'],
         'start_at_min': s1['start_at_min'],
         'stop_at_min' : s2['stop_at_min']
         })
  else: 
    return pd.concat([s1,s2],axis=1).T

但是我无法将此函数应用于数据框。这是我的尝试:

df.groupby('user').sort_values(by=['start_at_min']).apply(combine_rows) # this not working

完整代码如下:

import pandas as pd 
import numpy as np


df = pd.DataFrame({
    "user"       :  [1, 1, 2,2],
    'start_at_min': [60, 101, 65, 200], 
    'stop_at_min' : [100, 135, 100, 265] 
})

def combine_rows (s1:pd.Series, s2:pd.Series):
  # take 2 rows and combine them if start_at_min row2 - stop_at_min row1 < 5 
  if s2['start_at_min'] - s1['stop_at_min'] <5: 
     return pd.Series({
         'user': s1['user'],
         'start_at_min': s1['start_at_min'],
         'stop_at_min' : s2['stop_at_min']
         })
  else: 
    return pd.concat([s1,s2],axis=1).T

df.groupby('user').sort_values(by=['start_at_min']).apply(combine_rows) # this not working

最佳答案

版本 1:一个条件

执行自定义groupby.agg :

threshold = 5
# if the successive stop/start per group are above threshold
# start a new group
group = (df['start_at_min']
         .sub(df.groupby('user')['stop_at_min'].shift())
         .ge(threshold).cumsum()
        )

# groupby.agg
out = (df.groupby(['user', group], as_index=False)
         .agg({'start_at_min': 'min',
               'stop_at_min': 'max'
              })
      )

输出:

   user  start_at_min  stop_at_min
0     1            60          135
1     2            65          100
2     2           200          265

中级:

(df['start_at_min']
 .sub(df.groupby('user')['stop_at_min'].shift())
)

0      NaN
1      1.0  # below threshold, this will be merged
2      NaN
3    100.0  # above threshold, keep separate
dtype: float64

版本 2:多个条件

# define variable threshold
threshold = np.where(df['start_at_min'].le(200), 5, 3)
# array([3, 3, 5, 3, 3, 5, 5])

# compute the new starts of group like in version 1
# but using the now variable threshold
m1 = (df['start_at_min']
         .sub(df.groupby('user')['stop_at_min'].shift())
         .ge(threshold)    
        )
# add a second restart condition (>400)
m2 = df['start_at_min'].gt(400)

# if either mask is True, start a new group
group = (m1|m2).cumsum()

# groupby.agg
out = (df.groupby(['user', group], as_index=False)
         .agg({'start_at_min': 'min',
               'stop_at_min': 'max'
              })
      )

输出:

   user  start_at_min  stop_at_min
0     1           100          201
1     1           205          260
2     2            65          100
3     2           200          265
4     2           300          451
5     2           452          460

关于python - 如何将groupby中的行与多个条件组合起来？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/74814386/

python - 如何将groupby中的行与多个条件组合起来？

版本 1:一个条件

版本 2:多个条件

上一篇：reactjs - NextJs 13 中的图像组件如何工作？

下一篇：flutter - Flutter 中使用 auto_route 包实现同一页面的多个路径