python - Pandas 滚动并忽略计数中包含 NaN 的行

标签 python pandas

示例数据

                                   id  val       date
id           date                                    
SE0000191827 2018-02-28  SE0000191827    8 2018-02-16
             2018-03-31           NaN  NaN        NaT
             2018-04-30  SE0000191827    7 2018-04-20
             2018-05-31           NaN  NaN        NaT
             2018-06-30           NaN  NaN        NaT
             2018-07-31  SE0000191827    6 2018-07-11
             2018-08-31           NaN  NaN        NaT
             2018-09-30           NaN  NaN        NaT
             2018-10-31  SE0000191827    5 2018-10-19
             2018-11-30           NaN  NaN        NaT
             2018-12-31  SE0000191827    9 2018-12-29
SE0000195570 2014-01-31  SE0000195570    4 2014-01-31
             2014-02-28           NaN  NaN        NaT
             2014-03-31           NaN  NaN        NaT
             2014-04-30  SE0000195570    3 2014-04-29
             2014-05-31           NaN  NaN        NaT
             2014-06-30           NaN  NaN        NaT
             2014-07-31  SE0000195570    2 2014-07-16
             2014-08-31           NaN  NaN        NaT
             2014-09-30           NaN  NaN        NaT
             2014-10-31  SE0000195570    1 2014-10-23

(为方便起见,请使用此粘贴箱创建此数据:https://pastebin.com/wMU3esEh)

我想对 val 列应用周期为 4 的 rolling 函数,但只计算 val 所在的行不是NaN。我无法使用 dropna,因为我需要具有 NaN 的行也接收新列中的值。我期望的数据如下。

预期输出

                                   id  val       date  calc
id           date                                          
SE0000191827 2018-02-28  SE0000191827    8 2018-02-16  26.0
             2018-03-31           NaN  NaN        NaT  27.0
             2018-04-30  SE0000191827    7 2018-04-20  27.0
             2018-05-31           NaN  NaN        NaT   NaN
             2018-06-30           NaN  NaN        NaT   NaN
             2018-07-31  SE0000191827    6 2018-07-11   NaN
             2018-08-31           NaN  NaN        NaT   NaN
             2018-09-30           NaN  NaN        NaT   NaN
             2018-10-31  SE0000191827    5 2018-10-19   NaN
             2018-11-30           NaN  NaN        NaT   NaN
             2018-12-31  SE0000191827    9 2018-12-29   NaN
SE0000195570 2014-01-31  SE0000195570    4 2014-01-31  10.0
             2014-02-28           NaN  NaN        NaT   NaN
             2014-03-31           NaN  NaN        NaT   NaN
             2014-04-30  SE0000195570    3 2014-04-29   NaN
             2014-05-31           NaN  NaN        NaT   NaN
             2014-06-30           NaN  NaN        NaT   NaN
             2014-07-31  SE0000195570    2 2014-07-16   NaN
             2014-08-31           NaN  NaN        NaT   NaN
             2014-09-30           NaN  NaN        NaT   NaN
             2014-10-31  SE0000195570    1 2014-10-23   NaN

请注意,行 (SE0000191827, 2018-03-31) 也应获得值 27.0。原因是该行下面有四个 val 值,因此我想对其进行计数。

<小时/>

一种尝试如下:

(Pdb) df2.assign(calc=(df2.dropna()['val'].groupby(level=0).rolling(4).sum().shift(-3).reset_index(0, drop=True)))
                                   id  val       date  calc
id           date                                          
SE0000191827 2018-02-28  SE0000191827    8 2018-02-16  26.0
             2018-03-31           NaN  NaN        NaT   NaN
             2018-04-30  SE0000191827    7 2018-04-20  27.0
             2018-05-31           NaN  NaN        NaT   NaN
             2018-06-30           NaN  NaN        NaT   NaN
             2018-07-31  SE0000191827    6 2018-07-11   NaN
             2018-08-31           NaN  NaN        NaT   NaN
             2018-09-30           NaN  NaN        NaT   NaN
             2018-10-31  SE0000191827    5 2018-10-19   NaN
             2018-11-30           NaN  NaN        NaT   NaN
             2018-12-31  SE0000191827    9 2018-12-29   NaN
SE0000195570 2014-01-31  SE0000195570    4 2014-01-31  10.0
             2014-02-28           NaN  NaN        NaT   NaN
             2014-03-31           NaN  NaN        NaT   NaN
             2014-04-30  SE0000195570    3 2014-04-29   NaN
             2014-05-31           NaN  NaN        NaT   NaN
             2014-06-30           NaN  NaN        NaT   NaN
             2014-07-31  SE0000195570    2 2014-07-16   NaN
             2014-08-31           NaN  NaN        NaT   NaN
             2014-09-30           NaN  NaN        NaT   NaN
             2014-10-31  SE0000195570    1 2014-10-23   NaN

但是,这不会为 (SE0000191827, 2018-03-31) 行获取任何值,因为它被删除到 dropna 中。

<小时/>

据我所知,没有办法通过滚动来跳过其中包含NaN的行。有什么帮助吗?

最佳答案

我建议使用您的groupby(首先删除空值),然后使用df.reindex(index= <#put original index here>)将原始时间步推回到索引中,并且 df.fillna根据已计算的内容..这些值可以在 calc 中没有值的日期上进行估算与 focb (第一次观察向后进行)。这表示为 ffillbfill用 Pandas 的行话来说。

(基本上,将 .reindex(df2.index).groupby(level=0).bfill() 添加到分配函数的末尾)

关于python - Pandas 滚动并忽略计数中包含 NaN 的行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56967190/

相关文章:

python - 为什么当我按下 "a"和 "d"时我的坦克不动?

python - 将列表项打印为整数

python - 如何填写 Pandas 每组的最后一行?

python - 将 pandas 数据框保存到 csv 文件时的附加列

python - 根据另一列的信息用 Pandas 填充一个空列

python - Elasticsearch 无限大小

python - Python 中成对相关的优化计算

python - .* 在 Python 前瞻正则表达式中的用途是什么?

python - Pandas :如何找到一个群体的百分比?

python - 如何在 Bokeh 图中旋转 X 轴标签?