我有以下数据框:
Blast_hole Tag detector ID Detection_location Detection Time
190 385189144 CV23 24/02/2019 2:15:09 PM
148 385522358 CV23 24/02/2019 2:23:58 PM
136 385321882 CV23 24/02/2019 2:25:07 PM
238 385433175 CV23 25/02/2019 5:44:37 PM
89 385381344 CV23 25/02/2019 6:19:32 PM
177 385391526 CV23 25/02/2019 6:42:49 PM
138 385732572 CV23 3/03/2019 8:52:38 PM
145 385861350 CV23 3/03/2019 9:02:50 PM
196 385599574 CV23 3/03/2019 9:31:24 PM
我想按检测时间
对行进行分组,其中一小时内发生了三次检测。
代码:
df1['Detection Date & Time'] = pd.to_datetime(df1['Detection Date & Time'], errors = 'coerce')
s = df1.resample('H',on='Detection Date & Time')['Detection_Location'].transform('size')
df1 = df1[s.sort_index() >= 3]
df1 = df1.sort_values(by =['Detection Date & Time'])
df1['Date and Time'] = pd.to_datetime(df1['Date and Time'])
df1['Detection Date & Time'] = pd.to_datetime(df1['Detection Date & Time'])
f = lambda x: ','.join(x.astype(str))
df2=(df1.groupby([pd.Grouper(key='Detection Date & Time',freq='H'),
df1.Detection_Location]).agg({
'Blast Hole': f,
'East Coordinate': f,
'North Coordinate': f,
'Tag Detector ID': f,
'Collar':f,
'Detection Date & Time' : ['first','last','size']})
.reset_index()
.rename(columns = {'Detection Date & Time' : '', '<lambda>':''}))
问题是这段代码需要一个小时一个小时地运行,并且在一小时内是否有 3 次检测。就像它会检测到 2019 年 2 月 24 日下午 2.15 到 2.25 之间有 3 个检测,但它不会检测到一小时内的检测,而是在一小时之外的检测,例如 2019 年 2 月 25 日下午 5.44 到下午 6.42 之间有 3 个条目,但是超过 1 小时(例如下午 5 点或 6 点),因此它不会检测到。
当前结果:
Detection_Location Blast Hole Tag Detector ID Detection Start Time Detection end time Tags
CV23 190,148,136 385189144,385522358,385321882 2019-02-24 14:15:09 2019-02-24 14:25:07 3
预期结果:
Detection_Location Blast Hole Tag Detector ID Detection Start Time Detection end time Tags
CV23 190,148,136 385189144,385522358,385321882 2019-02-24 14:15:09 2019-02-24 14:25:07 3
CV23 238,89,177 385433175,385381344,385391526 2019-02-25 17:44:09 2019-02-25 18:42:09 3
CV23 138,145,196 385732572,385861350,385599574 2019-03-03 20:52:09 2019-03-03 21:31:09 3
最佳答案
这是一个多步骤过程:
- 创建数据框
- 清理数据框
- 删除
Detection Data & Time
处的所有行是NaN
- 删除
# get the data in
df = pd.read_excel('file.xls')
# rename as needed
df.rename(columns={'Detection_location': 'location',
'Detection Time': 'date_time',
'Tag detector ID': 'id'}, inplace=True)
df['date_time'] = pd.to_datetime(df['date_time'], errors = 'coerce')
df.dropna(inplace=True)
Blast_hole id location date_time
190 385189144 CV23 2019-02-24 14:15:09
148 385522358 CV23 2019-02-24 14:23:58
136 385321882 CV23 2019-02-24 14:25:07
238 385433175 CV23 2019-02-25 17:44:37
89 385381344 CV23 2019-02-25 18:19:32
177 385391526 CV23 2019-02-25 18:42:49
138 385732572 CV23 2019-03-03 20:52:38
145 385861350 CV23 2019-03-03 21:02:50
196 385599574 CV23 2019-03-03 21:31:24
- 根据位置将数据帧组织成单独的数据帧
- 假设有多个位置
- 排序依据
date_time
- 使用
diff
计算时间增量 - 使用
periods=2
将满足在特定时间范围内找到 3 个值的条件。
- 查找相关数据行
- 查找 1 小时内出现的连续 3 行数据
- 因为
periods=2
,抓取diff <= '1 hours'
所在的行 - 对于符合条件的每一行,获取前 2 行
locs = df.location.unique().tolist()
# create a dict of dataframe based upon location
df_locs = {loc: df[df.location == loc].copy() for loc in locs}
# organize data in the dict, create diff
combined_vals = list()
for loc in df_locs.keys():
df_locs[loc].sort_values('date_time', inplace=True)
df_locs[loc]['diff'] = df_locs[loc]['date_time'].diff(periods=2)
df_locs[loc].reset_index(inplace=True, drop=True)
relevant_row_indices = df_locs[loc][df_locs[loc]['diff'] <= '1 hours'].index.values
# create a list of lists containing the the values to combine
for row in relevant_row_indices:
x = df_locs[loc].iloc[row-2:row+1, :-1].to_string(header=False,
index=False,
index_names=False).split('\n')
# in order to properly split everything, an extra space is need after the location
x = [y.replace(loc, f'{loc} ') for y in x]
vals = [ele.strip().split(' ') for ele in x]
vals = list(zip(*vals))
vals = [','.join(x.strip() for x in y) for y in vals]
combined_vals.append(vals)
组合值:
[['190,148,136',
'385189144,385522358,385321882',
'CV23,CV23,CV23',
'2019-02-24 14:15:09,2019-02-24 14:23:58,2019-02-24 14:25:07'],
['238,89,177',
'385433175,385381344,385391526',
'CV23,CV23,CV23',
'2019-02-25 17:44:37,2019-02-25 18:19:32,2019-02-25 18:42:49'],
['138,145,196',
'385732572,385861350,385599574',
'CV23,CV23,CV23',
'2019-03-03 20:52:38,2019-03-03 21:02:50,2019-03-03 21:31:24']]
使用组合值创建新数据框:
df_combined = pd.DataFrame(combined_vals, columns=df.columns)
# Remove repetitious location values
df_combined['location'] = df_combined['location'].apply(lambda x: list(set(x.split(',')))[0])
# Create start time column
df_combined['start'] = df_combined['date_time'].str.split(',', expand=True)[0]
# Create end time column
df_combined['end'] = df_combined['date_time'].str.split(',', expand=True)[2]
df_combined.drop(columns='date_time', inplace=True)
最终:
Blast_hole id location start end
190,148,136 385189144,385522358,385321882 CV23 2019-02-24 14:15:09 2019-02-24 14:25:07
238,89,177 385433175,385381344,385391526 CV23 2019-02-25 17:44:37 2019-02-25 18:42:49
138,145,196 385732572,385861350,385599574 CV23 2019-03-03 20:52:38 2019-03-03 21:31:24
-
Tags
可以很容易地添加为一列,但我没有在这里添加它-
df_combined['Tags'] = 3
-
关于python - 对一小时内出现的单元格进行分组 - Groupby,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57863690/