给定一个包含许多变量和最终类列的 Julia 数据框:
julia> df
5×3 DataFrame
Row │ v1 v2 cl
│ Int64? Int64 Int64
─────┼───────────────────────
1 │ 10 1 2
2 │ 20 2 2
3 │ 300 10 1
4 │ 400 20 1
5 │ missing 30 1
我想获得一个分组的 df,其中包含按类别的汇总统计信息,其中类别按列,变量按行,例如:
julia> dfByCl
11×3 DataFrame
Row │ var cl_1 cl_2
│ String Float64? Float64?
─────┼───────────────────────────────
1 │ nrow 3.0 2.0
2 │ v1_mean 350.0 15.0
3 │ v1_std 70.7107 7.07107
4 │ v1_lb 252.002 5.20018
5 │ v1_ub 447.998 24.7998
6 │ v1_nm 2.0 2.0
7 │ v2_mean 20.0 1.5
8 │ v2_std 10.0 0.707107
9 │ v2_lb 8.68414 0.520018
10 │ v2_ub 31.3159 2.47998
11 │ v2_nm 3.0 2.0
而且我不想显式命名所有变量。 有没有比上面的代码更简单/更优雅的东西?
using Statistics, DataFrames, Distributions
meansk(data) = mean(skipmissing(data))
stdsk(data) = std(skipmissing(data))
nm(data) = sum(.! ismissing.(data))
ci(data::AbstractVector,α=0.05) = meansk(data) - quantile(Normal(),1-α/2)*stdsk(data)/sqrt(nm(data)), meansk(data) + quantile(Normal(),1-α/2)*stdsk(data)/sqrt(nm(data))
cilb(data) = ci(data)[1]
ciub(data) = ci(data)[2]
df = DataFrame(v1=[10,20,300,400,missing],v2=[1,2,10,20,30],cl=[2,2,1,1,1])
dfByCl_w = combine(groupby(df,["cl"]),
nrow,
names(df) .=> meansk .=> col -> col * "_mean",
names(df) .=> stdsk .=> col -> col * "_std",
names(df) .=> cilb .=> col -> col * "_lb",
names(df) .=> ciub .=> col -> col * "_ub",
names(df) .=> nm .=> col -> col * "_nm",
)
orderedNames = vcat("cl","nrow",[ ["$(n)_mean", "$(n)_std", "$(n)_lb", "$(n)_ub", "$(n)_nm"] for n in names(df)[1:end-1]]...)
dfByCl_w = dfByCl_w[:, orderedNames]
toStack = vcat("nrow",[ ["$(n)_mean", "$(n)_std", "$(n)_lb", "$(n)_ub", "$(n)_nm"] for n in names(df)[1:end-1]]...)
dfByCl_l = stack(dfByCl_w,toStack)
dfByCl = unstack(dfByCl_l,"cl","value")
rename!(dfByCl,vcat("var",["cl_$(c)" for c in unique(dfByCl_w.cl)]))
最佳答案
在这种情况下我通常会这样做:
julia> cilb(data,α=0.05) = mean(data) - quantile(Normal(),1-α/2)*std(data)/sqrt(count(x -> true, data))
cilb (generic function with 2 methods)
julia> ciub(data,α=0.05) = mean(data) + quantile(Normal(),1-α/2)*std(data)/sqrt(count(x -> true, data))
ciub (generic function with 2 methods)
julia> combine(groupby(df, :cl),
nrow,
sdf -> describe(sdf, :mean, :std, cilb => :lb, ciub => :ub, :nmissing, cols=r"v"))
4×8 DataFrame
Row │ cl nrow variable mean std lb ub nmissing
│ Int64 Int64 Symbol Float64 Float64 Float64 Float64 Int64
─────┼─────────────────────────────────────────────────────────────────────────────
1 │ 1 3 v1 350.0 70.7107 252.002 447.998 1
2 │ 1 3 v2 20.0 10.0 8.68414 31.3159 0
3 │ 2 2 v1 15.0 7.07107 5.20018 24.7998 0
4 │ 2 2 v2 1.5 0.707107 0.520018 2.47998 0
稍后您可以根据需要重新调整它的形状,但也许这个布局是您真正想要的?
关于dataframe - 如何在不显式命名变量的情况下生成分组汇总统计数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/72827468/