python - 在 Pandas 数据框中按时间范围对行进行分组

标签 python pandas

我有一个由时间戳索引的大型数据框,我想在其中根据时间范围将行分配给组。

例如,在以下数据中,我已将行分组到组中第一个条目的 1 毫秒内。

                           groupid
1999-12-31 23:59:59.000107       1
1999-12-31 23:59:59.000385       1
1999-12-31 23:59:59.000404       1
1999-12-31 23:59:59.000704       1
1999-12-31 23:59:59.001281       2
1999-12-31 23:59:59.002211       2
1999-12-31 23:59:59.002367       3

我有工作代码,它通过迭代行并使用当前行对数据帧进行切片来完成此操作:

dts = sorted([datetime(1999, 12, 31, 23, 59, 59, x) for
              x in np.random.randint(1, 999999, 1000)])
df = pd.DataFrame({'groupid': None}, dts)

print df.head(20)

groupid = 1
for dt, row in df.iterrows():
    if df.loc[row.name].groupid:
        continue
    end = dt + timedelta(milliseconds=1)
    group = df.loc[dt:end]
    df.loc[group.index, 'groupid'] = groupid
    groupid += 1

print df.head(20)

但是,与 iterrows 一样,在大型数据帧上操作速度很慢。我在应用函数和使用 groupby 方面进行了各种尝试,但没有成功。使用 itertuples 是我能为提高性能所做的最好的事情吗(我现在要尝试)?有人可以给一些建议吗?

最佳答案

好的,我想下面就是你想要的,它通过用第一个值减去所有值来从你的索引构造一个 TimeDelta。然后我们访问微秒组件并除以 1000,然后将 Series dtype 转换为 int:

In [86]:

df['groupid'] = ((df.index.to_series() - df.index[0]).dt.microseconds / 1000).astype(np.int32)
df
Out[86]:
                            groupid
1999-12-31 23:59:59.000133        0
1999-12-31 23:59:59.000584        0
1999-12-31 23:59:59.003544        3
1999-12-31 23:59:59.009193        9
1999-12-31 23:59:59.010220       10
1999-12-31 23:59:59.010632       10
1999-12-31 23:59:59.010716       10
1999-12-31 23:59:59.011387       11
1999-12-31 23:59:59.011837       11
1999-12-31 23:59:59.013277       13
1999-12-31 23:59:59.013305       13
1999-12-31 23:59:59.014754       14
1999-12-31 23:59:59.016015       15
1999-12-31 23:59:59.016067       15
1999-12-31 23:59:59.017788       17
1999-12-31 23:59:59.018236       18
1999-12-31 23:59:59.021281       21
1999-12-31 23:59:59.021772       21
1999-12-31 23:59:59.021927       21
1999-12-31 23:59:59.022200       22
1999-12-31 23:59:59.023104       22
1999-12-31 23:59:59.023375       23
1999-12-31 23:59:59.023688       23
1999-12-31 23:59:59.023726       23
1999-12-31 23:59:59.025397       25
1999-12-31 23:59:59.026407       26
1999-12-31 23:59:59.026480       26
1999-12-31 23:59:59.027825       27
1999-12-31 23:59:59.028793       28
1999-12-31 23:59:59.030716       30
...                             ...
1999-12-31 23:59:59.975432      975
1999-12-31 23:59:59.976699      976
1999-12-31 23:59:59.977177      977
1999-12-31 23:59:59.979475      979
1999-12-31 23:59:59.980282      980
1999-12-31 23:59:59.980672      980
1999-12-31 23:59:59.983202      983
1999-12-31 23:59:59.984214      984
1999-12-31 23:59:59.984674      984
1999-12-31 23:59:59.984933      984
1999-12-31 23:59:59.985664      985
1999-12-31 23:59:59.985779      985
1999-12-31 23:59:59.988812      988
1999-12-31 23:59:59.989324      989
1999-12-31 23:59:59.990386      990
1999-12-31 23:59:59.990485      990
1999-12-31 23:59:59.990969      990
1999-12-31 23:59:59.991255      991
1999-12-31 23:59:59.991739      991
1999-12-31 23:59:59.993979      993
1999-12-31 23:59:59.994705      994
1999-12-31 23:59:59.994874      994
1999-12-31 23:59:59.995397      995
1999-12-31 23:59:59.995753      995
1999-12-31 23:59:59.995863      995
1999-12-31 23:59:59.996574      996
1999-12-31 23:59:59.998139      998
1999-12-31 23:59:59.998533      998
1999-12-31 23:59:59.998778      998
1999-12-31 23:59:59.999915      999

感谢@Jeff 指出更简洁的方法:

In [96]:
df['groupid'] = (df.index-df.index[0]).astype('timedelta64[ms]')
df

Out[96]:
                            groupid
1999-12-31 23:59:59.000884        0
1999-12-31 23:59:59.001175        0
1999-12-31 23:59:59.001262        0
1999-12-31 23:59:59.001540        0
1999-12-31 23:59:59.001769        0
1999-12-31 23:59:59.002478        1
1999-12-31 23:59:59.005001        4
1999-12-31 23:59:59.005497        4
1999-12-31 23:59:59.006908        6
1999-12-31 23:59:59.008860        7
1999-12-31 23:59:59.009257        8
1999-12-31 23:59:59.010012        9
1999-12-31 23:59:59.011451       10
1999-12-31 23:59:59.013177       12
1999-12-31 23:59:59.014138       13
1999-12-31 23:59:59.015795       14
1999-12-31 23:59:59.015865       14
1999-12-31 23:59:59.016069       15
1999-12-31 23:59:59.016666       15
1999-12-31 23:59:59.016718       15
1999-12-31 23:59:59.019058       18
1999-12-31 23:59:59.019675       18
1999-12-31 23:59:59.020747       19
1999-12-31 23:59:59.021856       20
1999-12-31 23:59:59.022959       22
1999-12-31 23:59:59.023812       22
1999-12-31 23:59:59.023938       23
1999-12-31 23:59:59.024122       23
1999-12-31 23:59:59.025332       24
1999-12-31 23:59:59.025397       24
...                             ...
1999-12-31 23:59:59.959725      958
1999-12-31 23:59:59.959742      958
1999-12-31 23:59:59.959892      959
1999-12-31 23:59:59.960345      959
1999-12-31 23:59:59.960800      959
1999-12-31 23:59:59.961054      960
1999-12-31 23:59:59.962749      961
1999-12-31 23:59:59.965681      964
1999-12-31 23:59:59.966409      965
1999-12-31 23:59:59.966558      965
1999-12-31 23:59:59.967357      966
1999-12-31 23:59:59.967842      966
1999-12-31 23:59:59.970465      969
1999-12-31 23:59:59.974022      973
1999-12-31 23:59:59.974734      973
1999-12-31 23:59:59.975879      974
1999-12-31 23:59:59.978291      977
1999-12-31 23:59:59.980483      979
1999-12-31 23:59:59.980868      979
1999-12-31 23:59:59.981417      980
1999-12-31 23:59:59.984208      983
1999-12-31 23:59:59.984639      983
1999-12-31 23:59:59.985533      984
1999-12-31 23:59:59.986785      985
1999-12-31 23:59:59.987502      986
1999-12-31 23:59:59.987914      987
1999-12-31 23:59:59.988406      987
1999-12-31 23:59:59.989436      988
1999-12-31 23:59:59.994449      993
1999-12-31 23:59:59.996657      995

关于python - 在 Pandas 数据框中按时间范围对行进行分组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32089394/

相关文章:

java - 将Java类导入python项目

python - Flask 无法识别两个 URL 参数

python - 解析 git - 使用 python 的日志文件

python - matplotlib:如何降低子图中刻度标签的密度?

python - 获取组名称作为图形 matplotlib 中的轴

python - Tensorflow 可视化工具 "Tensorboard"在 Anaconda 下不工作

python - Scipy NDimage 关联 : unbearably slow

python - 将 Pandas 数据框转换为 PySpark 数据框会降低索引

python - 根据 pandas 数据框第 3 列中的标准,按天分组的 2 列的加权平均值

python - 带有 "Wide"数据的 Pandas groupby