python - 通过与连续组进行比较来过滤 pandas groupby

我有一个像这样的 pandas 数据框:

In [5]: df
Out[5]:
       date1      date2
0 2015-01-01 2014-12-11
1 2015-01-01 2014-12-30
2 2015-01-01 2015-01-01
3 2015-01-02 2015-12-30
4 2015-01-02 2015-01-01
5 2015-01-02 2015-01-02
6 2015-01-03 2015-01-01
7 2015-01-03 2015-01-02
8 2015-01-03 2015-01-03

我想在 date1 上对此数据帧进行分组，然后按 date2 >= 上一个组的 date1 的记录过滤每个组(并且date1 最小的记录不会被过滤掉)。我的最终目标是计算应用过滤器后每组中剩余的项目数。

过滤将留下以下行:

       date1    date2
0 2015-01-01  2014-12-11
1 2015-01-01  2014-12-30
2 2015-01-02  2015-01-01
4 2015-01-02  2015-01-01
5 2015-01-02  2015-01-02
7 2015-01-03  2015-01-02
8 2015-01-03  2015-01-03

然后计数将是:

    date1    count
0 2015-01-01 3
1 2015-01-02 2
2 2015-01-03 2

我可以按如下方式获取组:

groups = df.sort('timestamp', ascending=False).groupby('timestamp')

但我想不出一种方法来进行过滤和计数，以便比较连续的组。

最佳答案

一行使用 pd.merge_asof

pd.merge_asof(
    df, df[['date1']].assign(d_=df.date1),
    allow_exact_matches=False
).fillna(0).query('date2 >= d_').groupby('date1').size()

date1
2015-01-01    3
2015-01-02    2
2015-01-03    2
dtype: int64

说明

from the docs

For each row in the left DataFrame, we select the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key. Both DataFrames must be sorted by the key.

因此，我使 df 在 date1 上与自身合并，参数 allow_exact_matches 为 False。这使我可以轻松访问“上一个组”。

从那里，它是一个查询来过滤，groupby + size来获取计数。

关于python - 通过与连续组进行比较来过滤 pandas groupby，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41510099/

python - 通过与连续组进行比较来过滤 pandas groupby

上一篇：python - Django - 在模型中，created_at 是一个 UNIX 时间戳

下一篇：python - 我如何让它在情节上显示图例？