我对 Python 和数据科学还不太熟悉。
我有这两个数据框: df 数据框
df = pd.DataFrame({"Date": ['2014-11-21 11:00:00', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:07', '2014-11-21 11:00:08', '2014-11-21 11:00:10', '2014-11-21 11:00:11', '2014-10-24 10:00:55', '2014-10-24 10:00:59'], "A":[1, 2, 5, 3, 9, 6, 3, 0, 8, 10]})
Date A
0 2014-11-21 11:00:00 1
1 2014-11-21 11:00:03 2
2 2014-11-21 11:00:04 5
3 2014-11-21 11:00:05 3
4 2014-11-21 11:00:07 9
5 2014-11-21 11:00:08 6
6 2014-11-21 11:00:10 3
7 2014-11-21 11:00:11 0
8 2014-10-24 10:00:55 8
9 2014-10-24 10:00:59 10
info Dataframe,此数据框包含我的最终 df 应包含的日期时间范围
info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:08:00', '2014-10-24 10:55:00'], "Stop": ['2014-11-21 11:07:00', '2014-11-21 11:11:00', '2014-10-24 10:59:00']})
Start Stop
0 2014-11-21 11:00:00 2014-11-21 11:00:07
1 2014-11-21 11:00:08 2014-11-21 11:00:11
2 2014-10-24 10:00:55 2014-10-24 10:00:59
目标是使用两秒窗口
计算df
中的累积和,当且仅当df
中的实际行是在 info
中的某一行的范围内。例如,日期为 2014-11-21 11:00:08
的行的累积总和应为 0
。因为它位于范围的开头,另一个示例是日期为 2014-11-21 11:00:07
的行,其总和应为 12(9+3)
.
这是我到目前为止所取得的成就:
import pandas as pd
import numpy as np
df = pd.DataFrame({"Date": ['2014-11-21 11:00:00', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:07', '2014-11-21 11:00:08', '2014-11-21 11:00:10', '2014-11-21 11:00:11', '2014-10-24 10:00:55', '2014-10-24 10:00:59'], "A":[1, 2, 5, 3, 9, 6, 3, 0, 8, 10]})
info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:00:08', '2014-10-24 10:00:55'], "Stop": ['2014-11-21 11:00:07', '2014-11-21 11:00:11', '2014-10-24 10:00:59']})
#info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:00:00', '2014-11-21 11:00:00', '2014-11-21 11:00:01', '2014-11-21 11:00:02', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05'], "Stop": ['2014-11-21 11:00:00', '2014-11-21 11:00:01', '2014-11-21 11:00:02', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:06', '2014-11-21 11:00:07']})
info['groupnum']=info.index
info.Start=pd.to_datetime(info.Start)
info.Stop=pd.to_datetime(info.Stop)
cinfo = info.set_index(pd.IntervalIndex.from_arrays(info.Start, info.Stop, closed='both'))['groupnum']
df['groupnum']=pd.to_datetime(df.Date).map(cinfo)
df['cum'] = df.groupby('groupnum').A.cumsum()
print(df)
预期结果:
Date A groupnum cum
0 2014-11-21 11:00:00 1 0 1
1 2014-11-21 11:00:03 2 0 2
2 2014-11-21 11:00:04 5 0 7
3 2014-11-21 11:00:05 3 0 10
4 2014-11-21 11:00:07 9 0 12
5 2014-11-21 11:00:08 6 1 6
6 2014-11-21 11:00:10 3 1 9
7 2014-11-21 11:00:11 0 1 3
8 2014-10-24 10:00:55 8 2 8
9 2014-10-24 10:00:59 10 2 10
实际结果:
Date A groupnum cum
0 2014-11-21 11:00:00 1 0 1
1 2014-11-21 11:00:03 2 0 3
2 2014-11-21 11:00:04 5 0 8
3 2014-11-21 11:00:05 3 0 11
4 2014-11-21 11:00:07 9 0 20
5 2014-11-21 11:00:08 6 1 6
6 2014-11-21 11:00:10 3 1 9
7 2014-11-21 11:00:11 0 1 9
8 2014-10-24 10:00:55 8 2 8
9 2014-10-24 10:00:59 10 2 18
但是这是对 groupnum 进行累积和,我无法仅累积 2 秒。
那么有什么适当的方法来实现这一目标吗?我将不胜感激。
我的英语不太好,希望我能正确解释你的意思
最佳答案
此方法可能不适用于 100M 行数据框
要创建 groupnum 列,您可以 ufunc.outer与 greater_equal
和 less_equal
将 df
中的每个时间与 info
中的每个开始和停止进行比较,并使用 argmax
逐行获取其 True 位置。然后,您可以在此列上groupby
,并在 2 秒上滚动
# create an boolean array to find in which range each row is
arr_bool = ( np.greater_equal.outer(df.Date.to_numpy(), info.Start.to_numpy())
& np.less_equal.outer(df.Date.to_numpy(), info.Stop.to_numpy()))
# use argmax to find the position of the first True row-wise
df['groupnum'] = arr_bool.argmax(axis=1)
# select only rows within ranges, use set_index for later rolling and index alignment
df = df.loc[arr_bool.any(axis=1), :].set_index('Date')
# groupby groupnum, do the sum for a closed interval of 2s
df['cum'] = df.groupby('groupnum').rolling('2s', closed = 'both').A.sum()\
.reset_index(level=0, drop=True) # for index alignment
df = df.reset_index() # get back date as a column
print (df)
Date A groupnum cum
0 2014-11-21 11:00:00 1 0 1.0
1 2014-11-21 11:00:03 2 0 2.0
2 2014-11-21 11:00:04 5 0 7.0
3 2014-11-21 11:00:05 3 0 10.0
4 2014-11-21 11:00:07 9 0 12.0
5 2014-11-21 11:00:08 6 1 6.0
6 2014-11-21 11:00:10 3 1 9.0
7 2014-11-21 11:00:11 0 1 3.0
8 2014-10-24 10:00:55 8 2 8.0
9 2014-10-24 10:00:59 10 2 10.0
编辑:如果arr_bool无法以这种方式创建
您可以尝试迭代 info
的行并独立检查它是否高于 start 且低于 stop:
# get once an array of all dates (should be faster)
arr_date = df.Date.to_numpy()
# create groups by sum
df['groupnum'] = np.sum([i* (np.greater_equal(arr_date, start)&np.less_equal(arr_date, stop))
for i, (start, stop) in enumerate(zip(info.Start.to_numpy(), info.Stop.to_numpy()), 1)], axis=0) - 1
# remove the rows that are not in any range
df = df.loc[df['groupnum'].ge(0), :].set_index('Date')
# then same for the column cum
df['cum] = ...
关于python - 使用范围数据集返回 2 秒的累积和,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59734481/