python - 以原点 ='end_day' 重新采样

标签 python pandas pandas-resample

我不明白 origin='end_day' 的作用。

docs举个例子:

>>> start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'
>>> rng = pd.date_range(start, end, freq='7min')
>>> ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
>>> ts 
2000-10-01 23:30:00     0
2000-10-01 23:37:00     3
2000-10-01 23:44:00     6
2000-10-01 23:51:00     9
2000-10-01 23:58:00    12
2000-10-02 00:05:00    15
2000-10-02 00:12:00    18
2000-10-02 00:19:00    21
2000-10-02 00:26:00    24
Freq: 7T, dtype: int32
>>> ts.resample('17min', origin='end_day').sum()
2000-10-01 23:38:00     3
2000-10-01 23:55:00    15
2000-10-02 00:12:00    45
2000-10-02 00:29:00    45
Freq: 17T, dtype: int32

文档对 origin='end_day' 的解释如下:

‘end_day’: origin is the ceiling midnight of the last day

据我了解,这条线

ts.resample('17min', origin='end_day').sum()

应该等于

ts.resample('17min', origin=ts.index.max().ceil('1d')).sum()

但是,传递时间戳 ts.index.max().ceil('1d') 会产生不同的结果:

>>> ts.resample('17min', origin=ts.index.max().ceil('1d')).sum() 
2000-10-01 23:21:00     3
2000-10-01 23:38:00    15
2000-10-01 23:55:00    27
2000-10-02 00:12:00    63

我正在寻找对这种差异的解释,也许是比文档提供的更好的 'end_day' 参数的一般描述。

编辑:我正在使用pandas 1.3.5

最佳答案

origin='end_day' 真正等价的是:

>>> ts.resample('17min', origin=ts.index.max().ceil('D'), 
                closed='right', label='right').sum()

2000-10-01 23:38:00     3
2000-10-01 23:55:00    15
2000-10-02 00:12:00    45
2000-10-02 00:29:00    45
Freq: 17T, dtype: int64

更新 1:

  1. What if I use origin='end_day' but also explicitly pass in closed and label not being 'right'? Where's the behavior defined for this?

来自source code 重新采样:

            # The backward resample sets ``closed`` to ``'right'`` by default
            # since the last value should be considered as the edge point for
            # the last bin. When origin in "end" or "end_day", the value for a
            # specific ``Timestamp`` index stands for the resample result from
            # the current ``Timestamp`` minus ``freq`` to the current
            # ``Timestamp`` with a right close.
            if origin in ["end", "end_day"]:
                if closed is None:
                    closed = "right"
                if label is None:
                    label = "right"
            else:
                if closed is None:
                    closed = "left"
                if label is None:
                    label = "left"

更新 2a:

  1. Consider df = pd.DataFrame(index=pd.date_range(start='2021-04-22 01:00:00', end='2021-04-28 01:00', freq='1d'), data=range(7)). Now df.resample(rule='7d', origin='end_day') crashes with a ValueError.

如果您没有显式设置close参数,则resample将其设置为right,因为origin='end_day'(见上文)。因此,origin 现在是“2021-04-29”,第一个 bin 值被排除在“2021-04-22”之外。您可能会遇到值落在第一个 bin 之前的情况:

df = pd.DataFrame(index=pd.date_range(start='2021-04-22 01:00:00', end='2021-04-28 01:00', freq='1d'), data=range(7))
df.resample(rule='7d', origin='end_day', closed='left')  # <- HERE

更新 2b:

If '2021-04-22' is the first bin, which timestamp does fall outside of it? '2021-04-22 01:00:00' is later, right?

df = pd.DataFrame(index=pd.date_range(start='2021-04-21 01:00:00', end='2021-04-28 01:00', freq='1d'), data=range(8))
print(df)

# Output:
                     0
2021-04-21 01:00:00  0
2021-04-22 01:00:00  1
2021-04-23 01:00:00  2
2021-04-24 01:00:00  3
2021-04-25 01:00:00  4
2021-04-26 01:00:00  5
2021-04-27 01:00:00  6
2021-04-28 01:00:00  7

通过这个示例,我想您应该会更清楚:

# closed='right' (default)
>>> df.resample(rule='7d', origin='end_day').sum()
             0
2021-04-22   1  # ('2021-04-15', '2021-04-22']
2021-04-29  27  # ('2021-04-22', '2021-04-29']

# closed='left'
>>> df.resample(rule='7d', origin='end_day', closed='left').sum()
             0
2021-04-22   0  # ['2021-04-15', '2021-04-22')
2021-04-29  28  # ['2021-04-22', '2021-04-29')

bin_edges

bin_edges值为:

# closed='right' (default)
>>> bin_edges
[1618531199999999999 1619135999999999999 1619740799999999999]

# after conversion
DatetimeIndex(['2021-04-15 23:59:59.999999999',
               '2021-04-22 23:59:59.999999999',
               '2021-04-29 23:59:59.999999999'],
              dtype='datetime64[ns]', freq=None)


# closed='left'
>>> bin_edges
[1618444800000000000 1619049600000000000 1619654400000000000]

# after conversion
DatetimeIndex(['2021-04-15',
               '2021-04-22',
               '2021-04-29'],
              dtype='datetime64[ns]', freq=None)

关于python - 以原点 ='end_day' 重新采样,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70420267/

相关文章:

python Pandas : select 2nd smallest value in groupby

python - 如何在 Python 中合并字典中的所有数据框

python - 如何使用 pandas 将 n 个 .csv 文件(可能是 20-30 个文件)与 1 个大 .csv 文件水平(轴 = 1)合并?

python - 使用 df.resample 时如何使 NaN 值总和为 NaN 而不是 0?

python - pandas 重新采样 - 5 分钟 block (不是每小时的第 5 分钟)

python - 使用 Pyparsing 为上下文相关元素编写语法规则

python - 从命令提示符运行时程序抛出(文件中以 '\xff' 开头的非 UTF-8 代码)

python - 重新采样 Pandas 数据帧而不填充缺失时间

python - 将列与 has 合并\n

python - 可以在 Heroku 上安装 PySide 或 PyQt 吗?