python - 获取时间戳在特定滑动窗口时间间隔内的行 Pandas (时间序列)

标签 python pandas date datetime time-series

我有一个这样的数据框:

i = pd.to_datetime(np.random.randint(time.time(), time.time()+5000, 10), unit='ms').sort_values()
df = pd.DataFrame({'A':range(10),'B':range(10,30,2),'C':range(10,40,3)},index = i)

df
                         A   B   C
1970-01-19 04:28:30.030  0  10  10
1970-01-19 04:28:30.374  1  12  13
1970-01-19 04:28:31.055  2  14  16
1970-01-19 04:28:32.026  3  16  19
1970-01-19 04:28:32.234  4  18  22
1970-01-19 04:28:32.569  5  20  25
1970-01-19 04:28:32.595  6  22  28
1970-01-19 04:28:33.520  7  24  31
1970-01-19 04:28:33.882  8  26  34
1970-01-19 04:28:34.019  9  28  37

我想要的是,对于每个索引,该索引的“1”间隔内的最后一行:

df2
                                    ix            A   B   C
1970-01-19 04:28:30.030  1970-01-19 04:28:30.374  1  12  13
1970-01-19 04:28:30.374  1970-01-19 04:28:31.055  2  14  16
1970-01-19 04:28:31.055  1970-01-19 04:28:32.026  3  16  19
1970-01-19 04:28:32.026  1970-01-19 04:28:32.595  6  22  28
1970-01-19 04:28:32.234  1970-01-19 04:28:32.595  6  22  28
1970-01-19 04:28:32.569  1970-01-19 04:28:33.520  7  24  31
1970-01-19 04:28:32.595  1970-01-19 04:28:33.520  7  24  31
1970-01-19 04:28:33.520  1970-01-19 04:28:34.019  9  28  37
1970-01-19 04:28:33.882  1970-01-19 04:28:34.019  9  28  37
1970-01-19 04:28:34.019             nan          nan nan nan

我目前正在用循环来做这件事。在每个索引处,我使用 df.between_time 获取时间间隔内的所有行,然后选择最后一行。但正如预期的那样,它确实很慢。我需要类似 df.shift 的时间,我查看了 tshiftshift(periods = 1, freq = 'S') 但它们不像类次那样工作,而是他们为每个索引添加指定的时间。有人可以帮助我实现这一目标吗?谢谢。

注意: 所需输出中的 ix 列是可选的。

PS:如果可以使用 min_periods 参数(如 pd.df.rolling),那就太好了!


编辑:

对于起始 df:

                         A   B   C
1970-01-19 04:28:34.883  0  10  10
1970-01-19 04:28:34.900  1  12  13
1970-01-19 04:28:35.531  2  14  16
1970-01-19 04:28:36.845  3  16  19
1970-01-19 04:28:37.664  4  18  22
1970-01-19 04:28:38.332  5  20  25
1970-01-19 04:28:38.444  6  22  28
1970-01-19 04:28:38.724  7  24  31
1970-01-19 04:28:38.787  8  26  34
1970-01-19 04:28:38.951  9  28  37

df['time'] = df.index
def last_time(time):
    time = str(time)
    start_time = datetime.datetime.strptime(time[11:],'%H:%M:%S.%f')
    end_time = start_time + datetime.timedelta(0,1)
    return df.between_time(start_time = str(start_time)[11:-7],end_time= 
                                        str(end_time)[11:-7]).iloc[-1]
df.apply(lambda x:last_time(x['time']),axis = 1)

# Output:
                         A   B   C                    time
1970-01-19 04:28:34.883  1  12  13 1970-01-19 04:28:34.900
1970-01-19 04:28:34.900  1  12  13 1970-01-19 04:28:34.900
1970-01-19 04:28:35.531  2  14  16 1970-01-19 04:28:35.531
1970-01-19 04:28:36.845  3  16  19 1970-01-19 04:28:36.845
1970-01-19 04:28:37.664  4  18  22 1970-01-19 04:28:37.664
1970-01-19 04:28:38.332  9  28  37 1970-01-19 04:28:38.951
1970-01-19 04:28:38.444  9  28  37 1970-01-19 04:28:38.951
1970-01-19 04:28:38.724  9  28  37 1970-01-19 04:28:38.951

但是如您所见,我只能获得 second 级别的精度,即考虑在 34 ​​到 35 之间,因此缺少 35.531 位于 34.88334.900 的区间内。

最佳答案

假设您的时间已排序,那么第 2 行的相应行将严格大于第 1 行的相应行。 例如:如果第 6 行是第 1 行,那么第 2 行只需要搜索 >=6

的行

考虑到这一点,我们只需要遍历索引一次(复杂度线性:O(n)):

import pandas as pd
from datetime import datetime

def time_compare(t1,t2):
     return datetime.strptime(t1,'%Y-%m-%d %H:%M:%S.%f').timestamp() - datetime.strptime(t2,'%Y-%m-%d %H:%M:%S.%f').timestamp() < 1

index_j = []
cursor = 0
tmp = list(df.index)
for i in tmp:
    if cursor < len(tmp):
        pass
    else:
        index_j.append(cursor-1)
        continue
    while time_compare(tmp[cursor],i):
        cursor += 1
        if cursor < len(tmp):
            pass
        else:
            break
    index_j.append(cursor-1)

使用这个 df:

>>> df
                         A   B   C
1970-01-19 04:28:34.883  0  10  10
1970-01-19 04:28:34.900  1  12  13
1970-01-19 04:28:35.531  2  14  16
1970-01-19 04:28:36.845  3  16  19
1970-01-19 04:28:37.664  4  18  22
1970-01-19 04:28:38.332  5  20  25
1970-01-19 04:28:38.444  6  22  28
1970-01-19 04:28:38.724  7  24  31
1970-01-19 04:28:38.787  8  26  34
1970-01-19 04:28:38.951  9  28  37



>>> index_j
[2, 2, 2, 4, 6, 9, 9, 9, 9, 9]

使用索引:

>>> [tmp[i] for i in index_j]
['1970-01-19 04:28:35.531', '1970-01-19 04:28:35.531', '1970-01-19 04:28:35.531', '1970-01-19 04:28:37.664', '1970-01-19 04:28:38.444', '1970-01-19 04:28:38.951', '1970-01-19 04:28:38.951', '1970-01-19 04:28:38.951', '1970-01-19 04:28:38.951', '1970-01-19 04:28:38.951']

关于python - 获取时间戳在特定滑动窗口时间间隔内的行 Pandas (时间序列),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58431223/

相关文章:

java - 如何只从java中存储在数据库中的日期中获取日期?

cocoa - 如何在 Cocoa 中创建一周的一系列日期

python - pandas 中数据帧的行并集

python - 将嵌套的字典列表转换为 pandas DataFrame

python - 拆分 pandas 数据框中的字符串数据

python - 当我有 FY 和 FQ 字符串时创建会计年度 (FY)、会计季度 (FQ) 时间序列

php - date_create_from_format() 的问题

python - NLTK:如何根据句子图提取信息?

python - Vertex AI - 查看管道输出

python - 谷歌数据 API 身份验证