python - 如何根据小时标准获得每天每组的最小值

标签 python python-3.x pandas vectorization pandas-groupby

我在下面给出了两个数据框供您测试

df = pd.DataFrame({
    'subject_id':[1,1,1,1,1,1,1,1,1,1,1],
    'time_1' :['2173-04-03 12:35:00','2173-04-03 17:00:00','2173-04-03 
         20:00:00','2173-04-04 11:00:00','2173-04-04 11:30:00','2173-04-04 
       12:00:00','2173-04-05 16:00:00','2173-04-05 22:00:00','2173-04-06 
       04:00:00','2173-04-06 04:30:00','2173-04-06 06:30:00'],
  'val' :[5,5,5,10,5,10,5,8,3,8,10]
 })


df1 = pd.DataFrame({
 'subject_id':[1,1,1,1,1,1,1,1,1,1,1],
 'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-03 
           12:59:00','2173-04-03 13:14:00','2173-04-03 13:37:00','2173-04-04 
           11:30:00','2173-04-05 16:00:00','2173-04-05 22:00:00','2173-04-06 
           04:00:00','2173-04-06 04:30:00','2173-04-06 08:00:00'],
 'val' :[5,5,5,5,10,5,5,8,3,4,6]
 })

我想做的是

1) 查找每个 subject_id 的每天超过 1 小时相同的所有值(来自 val 列),并且获取其中的最小值

请注意,也可以每 15 分钟持续时间捕获一次值,因此您可能需要考虑 5 条记录才能查看> 1 小时 情况)。请参阅下面的示例屏幕截图

2) 如果一天中没有相同时间超过 1 小时的值,则只需获取该 subject_id 当天最小值

以下一个主题的屏幕截图将帮助您理解,下面给出了我尝试过的代码

enter image description here

这是我尝试过的

df['time_1'] = pd.to_datetime(df['time_1'])
df['time_2'] = df['time_1'].shift(-1)
df['tdiff'] = (df['time_2'] - df['time_1']).dt.total_seconds() / 3600
df['reading_day'] = pd.DatetimeIndex(df['time_1']).day

# don't know how to apply if else condition here to check for 1 hr criteria
t1 = df.groupby(['subject_id','reading_start_day','tdiff])['val'].min() 

由于我必须将其应用于数百万条记录,因此任何优雅且高效的解决方案都会有所帮助

最佳答案

df = pd.DataFrame({
 'subject_id':[1,1,1,1,1,1,1,1,1,1],
 'time_1' :['2173-04-03 12:35:00','2173-04-03 17:00:00','2173-04-03 20:00:00','2173-04-04 11:00:00','2173-04-04 11:30:00','2173-04-04 12:00:00','2173-04-04 16:00:00','2173-04-04 22:00:00','2173-04-05 04:00:00','2173-04-05 06:30:00'],
  'val' :[5,5,5,10,5,10,5,8,8,10]
 })

# Separate Date and time
df['time_1']=pd.to_datetime(df['time_1'])
df['new_date'] = [d.date() for d in df['time_1']]
df['new_time'] = [d.time() for d in df['time_1']]


# find time diff in group with the first element to check > 1 hr
df['shift_val'] = df['val'].shift()
df1=df.assign(time_diff=df.groupby(['subject_id','new_date']).time_1.apply(lambda x: x - x.iloc[0]))

# Verify if time diff > 1 and value is not changed
df2=df1.loc[(df1['time_diff']/ np.timedelta64(1, 'h') >= 1) & (df1.val == df1.groupby('new_date').first().val[0])]
df3=df1.loc[(df1['time_diff']/ np.timedelta64(1, 'h') <= 1) & (df1.val == df1.shift_val)]

# Get the minimum within the group
df4=df2.append(df3).groupby(['new_date'], sort=False).min()

# drop unwanted columns
df4.drop(['new_time','shift_val','time_diff'],axis=1, inplace=True)

df4

输出

          subject_id    time_1     val
new_date            
2173-04-03  1   2173-04-03 17:00:00 5
2173-04-04  1   2173-04-04 16:00:00 5
2173-04-05  1   2173-04-05 04:00:00 8

关于python - 如何根据小时标准获得每天每组的最小值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57703423/

相关文章:

python - 在 Django 中编辑表单创建新实例

python - 了解 `conda install`( channel 和包)

python - 如何使用 QThread 在 GUI 不卡住的情况下查看文件下载进度(python 3.4,pyQt5)

python - Pandas DataFrame 和 numpy 标准差不同

python - 如何获取 pandas 中附加/合并 DataFrame 的行数?

python - Python获取无效证书的证书信息

python - Scipy nnls算法没有终止容错选项

python - 属性错误 : 'Recognizer' object has no attribute 'recognize'

powershell - 我的程序在IDLE和PyScripter中运行良好,但在PowerShell和命令行中却无法运行

python - Pandas 多索引嵌套排序和百分比