python - SQL 类使用 Python Pandas 汇总报告

我经常使用 dplyr 在 R 中生成单语句总结报告，如下所示:

a <- group_by(data,x) 
b <- summarise(a, 
                 # count distinct y where value is not missing
                 y_distinct = n_distinct(y[is.na(y) == F]),
                 # count distinct z where value is not missing
                 z_distinct = n_distinct(z[is.na(z) == F]),
                 # count total number of values
                 total = n(),
                 # count y where value not missing
                 y_not_missing = length(y[is.na(y) == F]),
                 # count y where value is missing
                 y_missing = length(y[is.na(y) == T]))

这类似于我在 SQL 中生成它的方式:

select
    count(distinct(case when y is not null then y end)) as y_distinct,
    count(distinct(case when z is not null then z end)) as z_distinct,
    count(1) as total,
    count(case when y is not null then 1 end) as y_not_missing,
    count(case when z is not null then 1 end) as y_missing
from data group by x

但是，我(Python 新手和)无法找到 Panda 的等效项，并且在文档中迷路了。我能够使用不同的 groupby -> agg 语句生成每个聚合，但需要帮助在单个数据框中生成报告(最好使用单个语句)。

最佳答案

尝试这样的事情:

In [18]: df
Out[18]:
   x    y    z
0  1  2.0  NaN
1  1  3.0  NaN
2  2  NaN  1.0
3  2  NaN  2.0
4  3  4.0  5.0

In [19]: def nulls(s):
    ...:     return s.isnull().sum()
    ...:

In [23]: r = df.groupby('x').agg(['nunique','size',nulls])

In [24]: r
Out[24]:
        y                  z
  nunique size nulls nunique size nulls
x
1       2    2   0.0       0    2   2.0
2       0    2   2.0       2    2   0.0
3       1    1   0.0       1    1   0.0

为了展平列:

In [25]: r.columns = r.columns.map('_'.join)

In [26]: r
Out[26]:
   y_nunique  y_size  y_nulls  z_nunique  z_size  z_nulls
x
1          2       2      0.0          0       2      2.0
2          0       2      2.0          2       2      0.0
3          1       1      0.0          1       1      0.0

关于python - SQL 类使用 Python Pandas 汇总报告，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48317641/

python - SQL 类使用 Python Pandas 汇总报告

上一篇：python - Pandas :根据条件删除组

下一篇：python - 映射多个数据框的值并填充列