我已经标记了按时间戳排序的数据组,我想将其减少到开始时间戳和最后时间戳,并获取与该组中该标记相对应的值的平均值。起始数据框示例:
timestamp value mark
1 2016-11-07 11:00:00 0.781726 1
2 2016-11-07 11:03:00 0.812757 2
3 2016-11-07 11:05:00 0.845348 2
4 2016-11-07 11:07:00 0.817394 2
5 2016-11-07 11:11:00 0.760787 1
6 2016-11-07 11:13:00 0.807892 1
7 2016-11-07 11:15:00 0.812965 1
8 2016-11-07 11:18:00 0.822001 1
我想要实现的目标:
start_timestamp end_timestamp (mean_)value mark
1 2016-11-07 11:00:00 2016-11-07 11:00:00 0.781726 1
2 2016-11-07 11:03:00 2016-11-07 11:07:00 0.825166 2
3 2016-11-07 11:11:00 2016-11-07 11:18:00 0.800911 1
知道执行此操作的最佳方法吗?我应该首先用唯一的标记标记每个批处理吗?
最佳答案
您需要groupby
由重复列标记
中的唯一组
的系列
组成,然后aggregate
first
, last
和 mean
:
print ((df.mark != df.mark.shift()).cumsum())
1 1
2 2
3 2
4 2
5 3
6 3
7 3
8 3
Name: mark, dtype: int32
df1 = df.groupby((df.mark != df.mark.shift()).cumsum()) \
.agg({'timestamp': ['first','last'], 'value':'mean', 'mark': 'first'})
#reset MultiIndex in columns
df1.columns = ['_'.join(col) for col in df1.columns]
#if necessary rename columns
df1 = df1.rename(columns=({'timestamp_first':'start_timestamp',
'timestamp_last':'end_timestamp',
'mark_first':'mark','value_mean':'(mean_)value'})) \
.rename_axis(None)
print (df1)
start_timestamp end_timestamp mark (mean_)value
1 2016-11-07 11:00:00 2016-11-07 11:00:00 1 0.781726
2 2016-11-07 11:03:00 2016-11-07 11:07:00 2 0.825166
3 2016-11-07 11:11:00 2016-11-07 11:18:00 1 0.800911
关于python - 将数据帧按特定列压缩为包含第一个和最后一个时间戳以及值平均值的行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40508309/