python - 如何写好长的 Pandas 聚合？

TL;DR

如何编写涉及 groupby()、unstack() 或 apply() 等许多操作的长聚合？

示例

假设您有一个 DataFrame()，其 n_sales = 1000 门票销售针对 n_events = 10 个不同的事件，例如

import pandas as pd
import numpy as np

sales = pd.DataFrame({
    'Event': np.random.choice(range(n_events), n_sales), 
    'Time': np.random.rand(n_sales)})

并且您想要绘制当晚至少售出了多少张 n = [50, 100] 门票的图: # of events over time where at least x/y tickets were sold

那我就这么做

accumulation_of_sales = sales.groupby(['Time', 'Event']).size().unstack().fillna(0).cumsum()
events_with_n_sales = accumulation_of_sales.apply(lambda x: x.value_counts(), axis=1).fillna(0)
events_with_geq_n_sales = events_with_n_sales[events_with_n_sales.columns[::-1]].cumsum(axis=1)

events_with_geq_n_sales[n].plot()

这对我来说似乎很难阅读，而且原则上这些行太长(参见 PEP )。所以，

如何最好地完成这种特定且类似的操作？
是否有一些针对初学者的教程/风格指南/...？也许不是特别适合 Pandas，但类似的语言？

最佳答案

编写多行 pandas 查询的一种方法是使用:

accumulation_of_sales = sales.groupby(['Time', 'Event'])\
                             .size()\
                             .unstack()\
                             .fillna(0)\
                             .cumsum()

有时我更喜欢将它们括在括号中。

但是，如果您经常在这里做几件事，则有一种更简单的方法。例如，每当您看到“groupby + unstack”时，您应该想到“pivot_table”:

sales.pivot_table(columns='Event', index='Time', aggfunc=len, fill_value=0).cumsum()

(这是等效的，更高效且更具可读性。)

关于python - 如何写好长的 Pandas 聚合？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/29700700/

python - 如何写好长的 Pandas 聚合？

上一篇：python - 队列中的所有任务已完成，但程序未继续

下一篇：python - 为什么 PIL/Pillow 裁剪不起作用？