给定以下时间序列(用于说明目的):
From | Till | Precipitation
2022-01-01 06:00:00 | 2022-01-02 06:00:00 | 0.5
2022-01-02 06:00:00 | 2022-01-03 06:00:00 | 1.2
2022-01-03 06:00:00 | 2022-01-04 06:00:00 | 0.0
2022-01-04 06:00:00 | 2022-01-05 06:00:00 | 1.3
2022-01-05 06:00:00 | 2022-01-06 06:00:00 | 9.8
2022-01-06 06:00:00 | 2022-01-07 06:00:00 | 0.1
我想估算 2022-01-02 00:00:00
到 2022-01-06 00:00:00
之间的每日降水量。我们可以假设表中每个给定间隔的降水率是恒定的。
手动操作我会假设类似
2022-01-02 00:00:00 | 2022-01-03 00:00:00 | 0.25 * 0.5 + 0.75 * 1.2
注意:现实世界的数据很可能看起来不太规则,有点像下面这样(缺失的间隔可以假设为 0.0):
From | Till | Precipitation
2022-01-01 05:45:12 | 2022-01-02 02:11:20 | 0.8
2022-01-03 02:01:59 | 2022-01-04 12:01:00 | 5.4
2022-01-04 06:00:00 | 2022-01-05 06:00:00 | 1.3
2022-01-05 07:10:00 | 2022-01-06 07:10:00 | 9.2
2022-01-06 02:54:00 | 2022-01-07 02:53:59 | 0.1
- 也许有一个库提供通用且高效的解决方案?
- 如果没有这样的库,如何以最有效的方式计算重采样时间序列?
最佳答案
只需计算周期重叠...我认为这会很快
import pandas as pd
import numpy as np
def create_test_data():
# just a helper to construct a test dataframe
from_dates = pd.date_range(start='2022-01-01 06:00:00', freq='D', periods=6)
till_dates = pd.date_range(start='2022-01-02 06:00:00', freq='D', periods=6)
precip_amounts = [0.5, 1.2, 1, 2, 3, 0.5]
return pd.DataFrame({'From': from_dates, 'Till': till_dates, 'Precip': precip_amounts})
def get_between(df, start_datetime, end_datetime):
# all the entries that end (Till) after start_time
# and start(From) before the end
mask1 = df['Till'] > start_datetime
mask2 = df['From'] < end_datetime
return df[mask1 & mask2]
def get_ratio_values(df, start_datetime, end_datetime, debug=True):
# get the ratios of the period windows
df2 = get_between(df, start_datetime, end_datetime) # get only the rows of interest
precip_values = df['Precip']
# get overlap from the end time of row to start of our period of interest
overlap_period1 = df2['Till'] - start
# get overlap from end of our period of interest and the start time of row
overlap_period2 = end - df2['From']
# get the "best" overlap for each row
best_overlap = np.minimum(overlap_period1, overlap_period2)
# get the period of each duration
window_durations = df2['Till'] - df2['From']
# calculate the ratios of overlap (cannot be greater than 1)
ratios = np.minimum(1.0, best_overlap / window_durations)
# calculate the value * the ratio
ratio_values = ratios * precip_values
if debug:
# just some prints for verification
print("Ratio * value = result")
print("----------------------")
print("\n".join(f"{x:0.3f} * {y:0.2f} = {z}" for x, y, z in zip(ratios, df['Precip'], ratio_values)))
print("----------------------")
return ratio_values
start = pd.to_datetime('2022-01-02 00:00:00')
end = pd.to_datetime('2022-01-04 00:00:00')
ratio_vals = get_ratio_values(create_test_data(), start, end)
total_precip = ratio_vals.sum()
print("SUM RESULT =", total_precip)
您也可以只计算第一个和最后一个条目,因为中间的任何内容都将始终为 1(这可能既简单又快速)
def get_ratio_values(df, start_datetime, end_datetime, debug=True):
# get the ratios of the period windows
df2 = get_between(df, start_datetime, end_datetime) # get only the rows of interest
precip_values = df['Precip']
# overlap with first row and duration of first row
overlap_start = df2[0]['Till'] - start
duration_start = df2[0]['Till'] - df2[0]['From']
# overlap with last row and duration of last row
overlap_end = end - df2[-1]['From']
duration_start = df2[-1]['Till'] - df2[-1]['From']
ratios = [1]* len(df2)
ratios[0] = overlap_start/duration_start
ratios[-1] = overlap_end/duration_end
return ratios * precip_values
关于python - 使用 Pandas 重新采样移位间隔,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/72412448/