python - 如何将groupby中的行与多个条件组合起来?

标签 python pandas dataframe

我想将 pandas df 中的行与以下逻辑组合:

  • 数据框按用户分组
  • 行按 start_at_min 排序
  • 行在以下情况下组合:

案例A: 如果start_at_min<=200:

  • 行1[停止时间] - 行2[开始时间] < 5
  • (例如:101 -100 = 1 -> 合并;200-100=100: -> 不合并)

案例 Bif 200> start_at_min<400:

  • 将阈值更改为 3

情况 C 如果 start_at_min>400:

  • 切勿合并

示例 df

   user  start_at_min  stop_at_min
0     1           100          150  
1     1           152          201 #row0 with row1 combine
2     1           205          260 #row1 with row 2 NO -> start_at_min above 200 -> threshol =3 
3     2            65          100 #no
4     2           200          265 #no
5     2           300          451 #no
6     2           452          460 #no -> start_at_min above 400-> never combine 

预期输出:

   user  start_at_min  stop_at_min
0     1           100          201 #row1 with row2 combine
2     1           205          260 #row2 with row 3 NO -> start_at_min above 200 -> threshol =3 
3     2            65          100 #no
4     2           200          265 #no
5     2           300          451 #no
6     2           452          460 #no -> start_at_min above 400-> never combine 

我已经编写了 merge_rows 函数,它接受 2 系列并应用此逻辑

def combine_rows (s1:pd.Series, s2:pd.Series):
  # take 2 rows and combine them if start_at_min row2 - stop_at_min row1 < 5 
  if s2['start_at_min'] - s1['stop_at_min'] <5: 
     return pd.Series({
         'user': s1['user'],
         'start_at_min': s1['start_at_min'],
         'stop_at_min' : s2['stop_at_min']
         })
  else: 
    return pd.concat([s1,s2],axis=1).T

但是我无法将此函数应用于数据框。 这是我的尝试:

df.groupby('user').sort_values(by=['start_at_min']).apply(combine_rows) # this not working 

完整代码如下:

import pandas as pd 
import numpy as np


df = pd.DataFrame({
    "user"       :  [1, 1, 2,2],
    'start_at_min': [60, 101, 65, 200], 
    'stop_at_min' : [100, 135, 100, 265] 
})

def combine_rows (s1:pd.Series, s2:pd.Series):
  # take 2 rows and combine them if start_at_min row2 - stop_at_min row1 < 5 
  if s2['start_at_min'] - s1['stop_at_min'] <5: 
     return pd.Series({
         'user': s1['user'],
         'start_at_min': s1['start_at_min'],
         'stop_at_min' : s2['stop_at_min']
         })
  else: 
    return pd.concat([s1,s2],axis=1).T

df.groupby('user').sort_values(by=['start_at_min']).apply(combine_rows) # this not working 

最佳答案

版本 1:一个条件

执行自定义groupby.agg :

threshold = 5
# if the successive stop/start per group are above threshold
# start a new group
group = (df['start_at_min']
         .sub(df.groupby('user')['stop_at_min'].shift())
         .ge(threshold).cumsum()
        )

# groupby.agg
out = (df.groupby(['user', group], as_index=False)
         .agg({'start_at_min': 'min',
               'stop_at_min': 'max'
              })
      )

输出:

   user  start_at_min  stop_at_min
0     1            60          135
1     2            65          100
2     2           200          265

中级:

(df['start_at_min']
 .sub(df.groupby('user')['stop_at_min'].shift())
)

0      NaN
1      1.0  # below threshold, this will be merged
2      NaN
3    100.0  # above threshold, keep separate
dtype: float64

版本 2:多个条件

# define variable threshold
threshold = np.where(df['start_at_min'].le(200), 5, 3)
# array([3, 3, 5, 3, 3, 5, 5])

# compute the new starts of group like in version 1
# but using the now variable threshold
m1 = (df['start_at_min']
         .sub(df.groupby('user')['stop_at_min'].shift())
         .ge(threshold)    
        )
# add a second restart condition (>400)
m2 = df['start_at_min'].gt(400)

# if either mask is True, start a new group
group = (m1|m2).cumsum()

# groupby.agg
out = (df.groupby(['user', group], as_index=False)
         .agg({'start_at_min': 'min',
               'stop_at_min': 'max'
              })
      )

输出:

   user  start_at_min  stop_at_min
0     1           100          201
1     1           205          260
2     2            65          100
3     2           200          265
4     2           300          451
5     2           452          460

关于python - 如何将groupby中的行与多个条件组合起来?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/74814386/

相关文章:

python - 在主进程中异步等待多处理队列

python - Scikit-Learn/Pandas : make a prediction using a saved model based on user input

python - 输入提示数据帧联合的最佳方法

r - 计算R中语料库中单个文档中的单词并将其放入数据框

python - 导入包导致 Anaconda 中出现 Unicode 错误

python - 将 Pandas 数据框转换为所需的 python 字典

python - 通过字符串前缀加入 Pandas 数据帧

python - 转换包含在数据框行值内的列表

python - 首先对较高层的数据帧进行排名,然后对较低层的数据帧进行排名

python - Pandas 将数据写入单独的 csv 文件