python - 使用范围数据集返回 2 秒的累积和

我对 Python 和数据科学还不太熟悉。

我有这两个数据框: df 数据框

df = pd.DataFrame({"Date": ['2014-11-21 11:00:00', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:07', '2014-11-21 11:00:08', '2014-11-21 11:00:10', '2014-11-21 11:00:11', '2014-10-24 10:00:55', '2014-10-24 10:00:59'], "A":[1, 2, 5, 3, 9, 6, 3, 0, 8, 10]})

                  Date   A
0  2014-11-21 11:00:00   1
1  2014-11-21 11:00:03   2
2  2014-11-21 11:00:04   5
3  2014-11-21 11:00:05   3
4  2014-11-21 11:00:07   9
5  2014-11-21 11:00:08   6
6  2014-11-21 11:00:10   3
7  2014-11-21 11:00:11   0
8  2014-10-24 10:00:55   8
9  2014-10-24 10:00:59  10

info Dataframe，此数据框包含我的最终 df 应包含的日期时间范围

info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:08:00', '2014-10-24 10:55:00'], "Stop": ['2014-11-21 11:07:00', '2014-11-21 11:11:00', '2014-10-24 10:59:00']})

                 Start                 Stop
0  2014-11-21 11:00:00  2014-11-21 11:00:07
1  2014-11-21 11:00:08  2014-11-21 11:00:11
2  2014-10-24 10:00:55  2014-10-24 10:00:59

目标是使用两秒窗口计算df中的累积和，当且仅当df中的实际行是在 info 中的某一行的范围内。例如，日期为 2014-11-21 11:00:08 的行的累积总和应为 0。因为它位于范围的开头，另一个示例是日期为 2014-11-21 11:00:07 的行，其总和应为 12(9+3) .

这是我到目前为止所取得的成就:

import pandas as pd
import numpy as np

df = pd.DataFrame({"Date": ['2014-11-21 11:00:00', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:07', '2014-11-21 11:00:08', '2014-11-21 11:00:10', '2014-11-21 11:00:11', '2014-10-24 10:00:55', '2014-10-24 10:00:59'], "A":[1, 2, 5, 3, 9, 6, 3, 0, 8, 10]})
info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:00:08', '2014-10-24 10:00:55'], "Stop": ['2014-11-21 11:00:07', '2014-11-21 11:00:11', '2014-10-24 10:00:59']})
#info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:00:00', '2014-11-21 11:00:00', '2014-11-21 11:00:01', '2014-11-21 11:00:02', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05'], "Stop": ['2014-11-21 11:00:00', '2014-11-21 11:00:01', '2014-11-21 11:00:02', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:06', '2014-11-21 11:00:07']})
info['groupnum']=info.index
info.Start=pd.to_datetime(info.Start)
info.Stop=pd.to_datetime(info.Stop)
cinfo = info.set_index(pd.IntervalIndex.from_arrays(info.Start, info.Stop, closed='both'))['groupnum']
df['groupnum']=pd.to_datetime(df.Date).map(cinfo)
df['cum'] = df.groupby('groupnum').A.cumsum()
print(df)

预期结果:

                  Date   A  groupnum  cum
0  2014-11-21 11:00:00   1         0    1
1  2014-11-21 11:00:03   2         0    2
2  2014-11-21 11:00:04   5         0    7
3  2014-11-21 11:00:05   3         0   10
4  2014-11-21 11:00:07   9         0   12
5  2014-11-21 11:00:08   6         1    6
6  2014-11-21 11:00:10   3         1    9
7  2014-11-21 11:00:11   0         1    3
8  2014-10-24 10:00:55   8         2    8
9  2014-10-24 10:00:59  10         2   10

实际结果:

                  Date   A  groupnum  cum
0  2014-11-21 11:00:00   1         0    1
1  2014-11-21 11:00:03   2         0    3
2  2014-11-21 11:00:04   5         0    8
3  2014-11-21 11:00:05   3         0   11
4  2014-11-21 11:00:07   9         0   20
5  2014-11-21 11:00:08   6         1    6
6  2014-11-21 11:00:10   3         1    9
7  2014-11-21 11:00:11   0         1    9
8  2014-10-24 10:00:55   8         2    8
9  2014-10-24 10:00:59  10         2   18

但是这是对 groupnum 进行累积和，我无法仅累积 2 秒。

那么有什么适当的方法来实现这一目标吗？我将不胜感激。

我的英语不太好，希望我能正确解释你的意思

最佳答案

此方法可能不适用于 100M 行数据框

要创建 groupnum 列，您可以 ufunc.outer与 greater_equal和 less_equal将 df 中的每个时间与 info 中的每个开始和停止进行比较，并使用 argmax 逐行获取其 True 位置。然后，您可以在此列上groupby，并在 2 秒上滚动

# create an boolean array to find in which range each row is
arr_bool = ( np.greater_equal.outer(df.Date.to_numpy(), info.Start.to_numpy())
             & np.less_equal.outer(df.Date.to_numpy(), info.Stop.to_numpy()))

# use argmax to find the position of the first True row-wise
df['groupnum'] = arr_bool.argmax(axis=1)

# select only rows within ranges, use set_index for later rolling and index alignment
df = df.loc[arr_bool.any(axis=1), :].set_index('Date')

# groupby groupnum, do the sum for a closed interval of 2s
df['cum'] = df.groupby('groupnum').rolling('2s', closed = 'both').A.sum()\
              .reset_index(level=0, drop=True) # for index alignment

df = df.reset_index() # get back date as a column
print (df)
                 Date   A  groupnum   cum
0 2014-11-21 11:00:00   1         0   1.0
1 2014-11-21 11:00:03   2         0   2.0
2 2014-11-21 11:00:04   5         0   7.0
3 2014-11-21 11:00:05   3         0  10.0
4 2014-11-21 11:00:07   9         0  12.0
5 2014-11-21 11:00:08   6         1   6.0
6 2014-11-21 11:00:10   3         1   9.0
7 2014-11-21 11:00:11   0         1   3.0
8 2014-10-24 10:00:55   8         2   8.0
9 2014-10-24 10:00:59  10         2  10.0

编辑:如果arr_bool无法以这种方式创建 您可以尝试迭代 info 的行并独立检查它是否高于 start 且低于 stop:

# get once an array of all dates (should be faster)
arr_date = df.Date.to_numpy()

# create groups by sum 
df['groupnum'] = np.sum([i* (np.greater_equal(arr_date, start)&np.less_equal(arr_date, stop)) 
                         for i, (start, stop) in enumerate(zip(info.Start.to_numpy(), info.Stop.to_numpy()), 1)], axis=0) - 1

# remove the rows that are not in any range
df = df.loc[df['groupnum'].ge(0), :].set_index('Date')

# then same for the column cum
df['cum] = ...

关于python - 使用范围数据集返回 2 秒的累积和，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59734481/

python - 使用范围数据集返回 2 秒的累积和

上一篇：python - 如何使用 Biopython 查找蛋白质的核苷酸序列？

下一篇：python - 与 Pandas 一起报道