我经常使用 dplyr 在 R 中生成单语句总结报告,如下所示:
a <- group_by(data,x)
b <- summarise(a,
# count distinct y where value is not missing
y_distinct = n_distinct(y[is.na(y) == F]),
# count distinct z where value is not missing
z_distinct = n_distinct(z[is.na(z) == F]),
# count total number of values
total = n(),
# count y where value not missing
y_not_missing = length(y[is.na(y) == F]),
# count y where value is missing
y_missing = length(y[is.na(y) == T]))
这类似于我在 SQL 中生成它的方式:
select
count(distinct(case when y is not null then y end)) as y_distinct,
count(distinct(case when z is not null then z end)) as z_distinct,
count(1) as total,
count(case when y is not null then 1 end) as y_not_missing,
count(case when z is not null then 1 end) as y_missing
from data group by x
但是,我(Python 新手和)无法找到 Panda 的等效项,并且在文档中迷路了。 我能够使用不同的 groupby -> agg 语句生成每个聚合, 但需要帮助在单个数据框中生成报告(最好使用单个语句)。
最佳答案
尝试这样的事情:
In [18]: df
Out[18]:
x y z
0 1 2.0 NaN
1 1 3.0 NaN
2 2 NaN 1.0
3 2 NaN 2.0
4 3 4.0 5.0
In [19]: def nulls(s):
...: return s.isnull().sum()
...:
In [23]: r = df.groupby('x').agg(['nunique','size',nulls])
In [24]: r
Out[24]:
y z
nunique size nulls nunique size nulls
x
1 2 2 0.0 0 2 2.0
2 0 2 2.0 2 2 0.0
3 1 1 0.0 1 1 0.0
为了展平列:
In [25]: r.columns = r.columns.map('_'.join)
In [26]: r
Out[26]:
y_nunique y_size y_nulls z_nunique z_size z_nulls
x
1 2 2 0.0 0 2 2.0
2 0 2 2.0 2 2 0.0
3 1 1 0.0 1 1 0.0
关于python - SQL 类使用 Python Pandas 汇总报告,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48317641/