python - 按值差异(时间戳)将列分为 N 组

标签 python pandas pandas-groupby

.csv 格式的示例数据

| No.|   IP     |      Unix_time     |    # integer unix time
| 1  | 1.1.1.1  |     1563552000     |    # equivalent to 12:00:00 AM
| 2  | 1.1.1.1  |     1563552030     |    # equivalent to 12:00:30 AM
| 3  | 1.1.1.1  |     1563552100     |    # equivalent to 12:01:40 AM
| 4  | 1.1.1.1  |     1563552110     |    # equivalent to 12:01:50 AM
| 5  | 1.1.1.1  |     1563552180     |    # equivalent to 12:03:00 AM
| 6  | 1.2.3.10 |     1563552120     |    

以下是使用 pandas groupby( )get_group( ) 函数的当前工作代码:

data = pd.read_csv(some_path, header=0)
root = data.groupby('IP')

for a in root.groups.keys():
    t = root.get_group(a)['Unix_time']
    print(a + 'has' + t.count() + 'record')

您将看到以下结果:

1.1.1.1 has 5 record
1.2.3.10 has 1 record

现在,我想要基于上面的代码进行一些改进。

对于相同的IP值(例如1.1.1.1),我想根据最大时间间隔(例如60秒)进一步创建子组 ,并计算每个子组中的元素数量。例如,在上面的示例数据中:

从第 1 行开始:第 2 行 Unix_time 值在 60​​ 秒以内,但第 3 行超出 60 秒。

因此,第 1-2 行是一个组,第 3-4 行是一个单独的组,第 5 行是一个单独的组。换句话说,组“1.1.1.1”现在有 3 个子组。结果应该是:

1.1.1.1 start time 1563552000 has 2 record within 60 secs
1.1.1.1 start time 1563552100 has 2 record within 60 secs
1.1.1.1 start time 1563552150 has 1 record within 60 secs
1.2.3.10 start time 1563552120 has 1 record within 60 secs

如何制作?

最佳答案

您可以使用pd.Grouper:

df['datetime'] = pd.to_datetime(df['Unix_time'], unit='s')
for n, g in df.groupby(['IP', pd.Grouper(freq='60s', key='datetime')]):
    print(f'{n[0]} start time {g.iloc[0, g.columns.get_loc("Unix_time")]} has {len(g)} records within 60 secs')

输出:

1.1.1.1  start time 1563552000 has 2 records within 60 secs
1.1.1.1  start time 1563552100 has 2 records within 60 secs
1.1.1.1  start time 1563552150 has 1 records within 60 secs
1.2.3.10 start time 1563552120 has 1 records within 60 secs
<小时/>

使用“根”和整数:

root = df.groupby(['IP',df['Unix_time']//60])

for n, g in root:
     print(f'{n[0]} start time {g.iloc[0, g.columns.get_loc("Unix_time")]} has {len(g)} records within 60 secs')

输出:

1.1.1.1  start time 1563552000 has 2 records within 60 secs
1.1.1.1  start time 1563552100 has 2 records within 60 secs
1.1.1.1  start time 1563552150 has 1 records within 60 secs
1.2.3.10 start time 1563552120 has 1 records within 60 secs

关于python - 按值差异(时间戳)将列分为 N 组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57518545/

相关文章:

python - 自定义 Keras 损失的奇怪 Nan 损失

Python 在数组/列表中查找数据索引,但有约束

python - Pandas - 将列拆分为行,同时保留索引

python - 如何在不引用列的情况下过滤 pandas DataFrame?

pandas - 如何按天聚合 pandas Dataframe

Python:带有 Main 命令和 Sub 命令的 ArgParse

python - 对大型分隔文件进行子集化的有效方法

python - cx_Oracle.NotSupportedError : Python value of type NAType not supported

python - 沿 xarray 中的单个维度对多个坐标进行分组

python - Groupby 多列和 Sum - 创建新列并添加 If 条件