python - 对一小时内出现的单元格进行分组 - Groupby

标签 python pandas group-by

我有以下数据框:

Blast_hole  Tag detector ID Detection_location       Detection Time
190          385189144               CV23           24/02/2019 2:15:09 PM
148          385522358               CV23           24/02/2019 2:23:58 PM
136          385321882               CV23           24/02/2019 2:25:07 PM
238          385433175               CV23           25/02/2019 5:44:37 PM
89           385381344               CV23           25/02/2019 6:19:32 PM
177          385391526               CV23           25/02/2019 6:42:49 PM
138          385732572               CV23           3/03/2019 8:52:38 PM
145          385861350               CV23           3/03/2019 9:02:50 PM
196          385599574               CV23           3/03/2019 9:31:24 PM

我想按检测时间对行进行分组,其中一小时内发生了三次检测。

代码:

df1['Detection Date & Time'] = pd.to_datetime(df1['Detection Date & Time'], errors = 'coerce')

s = df1.resample('H',on='Detection Date & Time')['Detection_Location'].transform('size')

df1 = df1[s.sort_index() >= 3]


df1 = df1.sort_values(by =['Detection Date & Time'])

df1['Date and Time'] = pd.to_datetime(df1['Date and Time'])
df1['Detection Date & Time'] = pd.to_datetime(df1['Detection Date & Time'])


f = lambda x: ','.join(x.astype(str))
df2=(df1.groupby([pd.Grouper(key='Detection Date & Time',freq='H'),
                 df1.Detection_Location]).agg({
        'Blast Hole': f,
        'East Coordinate': f,
        'North Coordinate': f,
        'Tag Detector ID': f,
        'Collar':f,
        'Detection Date & Time' : ['first','last','size']})
               .reset_index()
               .rename(columns = {'Detection Date & Time' : '', '<lambda>':''}))

问题是这段代码需要一个小时一个小时地运行,并且在一小时内是否有 3 次检测。就像它会检测到 2019 年 2 月 24 日下午 2.15 到 2.25 之间有 3 个检测,但它不会检测到一小时内的检测,而是在一小时之外的检测,例如 2019 年 2 月 25 日下午 5.44 到下午 6.42 之间有 3 个条目,但是超过 1 小时(例如下午 5 点或 6 点),因此它不会检测到。

当前结果:

Detection_Location   Blast Hole                Tag Detector ID Detection Start Time   Detection end time Tags
              CV23  190,148,136  385189144,385522358,385321882  2019-02-24 14:15:09  2019-02-24 14:25:07    3

预期结果:

Detection_Location              Blast Hole         Tag Detector ID             Detection Start Time   Detection end time Tags
              CV23             190,148,136    385189144,385522358,385321882     2019-02-24 14:15:09  2019-02-24 14:25:07    3
              CV23             238,89,177     385433175,385381344,385391526     2019-02-25 17:44:09  2019-02-25 18:42:09    3
              CV23             138,145,196    385732572,385861350,385599574     2019-03-03 20:52:09  2019-03-03 21:31:09    3

最佳答案

这是一个多步骤过程:

  • 创建数据框
  • 清理数据框
    • 删除 Detection Data & Time 处的所有行是 NaN
# get the data in
df = pd.read_excel('file.xls')

# rename as needed
df.rename(columns={'Detection_location': 'location',
                   'Detection Time': 'date_time',
                   'Tag detector ID': 'id'}, inplace=True)

df['date_time'] = pd.to_datetime(df['date_time'], errors = 'coerce')

df.dropna(inplace=True)


 Blast_hole         id location           date_time
        190  385189144     CV23 2019-02-24 14:15:09
        148  385522358     CV23 2019-02-24 14:23:58
        136  385321882     CV23 2019-02-24 14:25:07
        238  385433175     CV23 2019-02-25 17:44:37
         89  385381344     CV23 2019-02-25 18:19:32
        177  385391526     CV23 2019-02-25 18:42:49
        138  385732572     CV23 2019-03-03 20:52:38
        145  385861350     CV23 2019-03-03 21:02:50
        196  385599574     CV23 2019-03-03 21:31:24
  • 根据位置将数据帧组织成单独的数据帧
    • 假设有多个位置
    • 排序依据 date_time
    • 使用diff计算时间增量
    • 使用 periods=2将满足在特定时间范围内找到 3 个值的条件。
  • 查找相关数据行
    • 查找 1 小时内出现的连续 3 行数据
    • 因为periods=2 ,抓取 diff <= '1 hours' 所在的行
    • 对于符合条件的每一行,获取前 2 行
locs = df.location.unique().tolist()

# create a dict of dataframe based upon location
df_locs = {loc: df[df.location == loc].copy() for loc in locs}

# organize data in the dict, create diff
combined_vals = list()
for loc in df_locs.keys():
    df_locs[loc].sort_values('date_time', inplace=True)
    df_locs[loc]['diff'] = df_locs[loc]['date_time'].diff(periods=2)
    df_locs[loc].reset_index(inplace=True, drop=True)
    relevant_row_indices = df_locs[loc][df_locs[loc]['diff'] <= '1 hours'].index.values

    # create a list of lists containing the the values to combine
    for row in relevant_row_indices:
        x = df_locs[loc].iloc[row-2:row+1, :-1].to_string(header=False,
                                                          index=False,
                                                          index_names=False).split('\n')

        # in order to properly split everything, an extra space is need after the location
        x = [y.replace(loc, f'{loc} ') for y in x]
        vals = [ele.strip().split('  ') for ele in x]
        vals = list(zip(*vals))
        vals = [','.join(x.strip() for x in y) for y in vals]
        combined_vals.append(vals)

组合值:

[['190,148,136',
  '385189144,385522358,385321882',
  'CV23,CV23,CV23',
  '2019-02-24 14:15:09,2019-02-24 14:23:58,2019-02-24 14:25:07'],
 ['238,89,177',
  '385433175,385381344,385391526',
  'CV23,CV23,CV23',
  '2019-02-25 17:44:37,2019-02-25 18:19:32,2019-02-25 18:42:49'],
 ['138,145,196',
  '385732572,385861350,385599574',
  'CV23,CV23,CV23',
  '2019-03-03 20:52:38,2019-03-03 21:02:50,2019-03-03 21:31:24']]

使用组合值创建新数据框:

df_combined = pd.DataFrame(combined_vals, columns=df.columns)

# Remove repetitious location values
df_combined['location'] = df_combined['location'].apply(lambda x: list(set(x.split(',')))[0])

# Create start time column
df_combined['start'] = df_combined['date_time'].str.split(',', expand=True)[0]

# Create end time column
df_combined['end'] = df_combined['date_time'].str.split(',', expand=True)[2]  
df_combined.drop(columns='date_time', inplace=True)

最终:

  Blast_hole                             id location                start                  end
 190,148,136  385189144,385522358,385321882     CV23  2019-02-24 14:15:09  2019-02-24 14:25:07
  238,89,177  385433175,385381344,385391526     CV23  2019-02-25 17:44:37  2019-02-25 18:42:49
 138,145,196  385732572,385861350,385599574     CV23  2019-03-03 20:52:38  2019-03-03 21:31:24
  • Tags可以很容易地添加为一列,但我没有在这里添加它
    • df_combined['Tags'] = 3

关于python - 对一小时内出现的单元格进行分组 - Groupby,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57863690/

相关文章:

python - `spark-submit` 没有 Spark 库的普通 Python 脚本

Python statsmodels lib 弃用警告

python - read_csv 使用 dtypes 但列中有 na 值

python - 如何使用 pandas 中的这个 "|"符号转换数据以用于推荐系统

sql - oracle sql组计数

python pandas 在最后一个非 NaN 值处停止 fillna

python - 如何在 Python 中获取 2 个字典列表的并集

python - Python 类构造函数中是否有 `self.somevariable = somevariable` 的快捷方式?

azure - Linq to Entities,当我尝试使用 group .. 时操作超时

scala - GroupBy 多列作为键并对多列求和,如 sql 吗?