julia - 如何在 Julia 中按操作进行快速分组?

标签 julia

特别是,我想要类似 R::data.table 的东西。 d[, function(...), by = key] .使用另一个 Stackoverflow 问题的答案(
Julia Dataframe group by and pivot tables functions )我有这个解决方案:

using DataFrames

df =DataFrame(Location = [ "NY", "SF", "NY", "NY", "SF", "SF", "TX", "TX", "TX", "DC"],
                 Class = ["H","L","H","L","L","H", "H","L","L","M"],
                 Address = ["12 Silver","10 Fak","12 Silver","1 North","10 Fak","2 Fake", "1 Red","1 Dog","2 Fake","1 White"],
                 Score = ["4","5","3","2","1","5","4","3","2","1"])


julia> by(df, :Location, d -> DataFrame(count=nrow(d)))
4x2 DataFrames.DataFrame
| Row | Location | count |
|-----|----------|-------|
| 1   | "DC"     | 1     |
| 2   | "NY"     | 3     |
| 3   | "SF"     | 3     |
| 4   | "TX"     | 3     |

这工作正常,但结果证明对于大型数据集来说非常慢。有没有更快的解决方案?

最佳答案

对于计数,以下解决方案更快但不那么可读:

cmap = countmap(df[:Location]); 
res = DataFrame(Location=collect(keys(cmap)),count=collect(values(cmap)))

或者,更一般地(再次计数):
countdf(df::DataFrame, fld) = 
  ( h = countmap(df[fld]) ; DataFrame(collect.([keys(h),values(h)]),[fld,:count]) )

给予:
julia> countdf(df,:Location)
4×2 DataFrames.DataFrame
│ Row │ Location │ count │
├─────┼──────────┼───────┤
│ 1   │ "DC"     │ 1     │
│ 2   │ "SF"     │ 3     │
│ 3   │ "NY"     │ 3     │
│ 4   │ "TX"     │ 3     │

对于其他聚合函数(可以顺序计算),我们可以定义函数:
foldmap(op, v0, df, col) = 
  foldl((x,y)->setindex!(x,op(get(x,y[col],v0),y),y[col]),
  Dict{eltype(df[col]),typeof(v0)}(), eachrow(df))
folddf(op, v0, df, col) = 
  (h = foldmap(op, v0, df, col) ; 
   DataFrame(collect.([keys(h),values(h)]),[col,:res]) )

inc1(x,y) = x+1
sumScore(x,y) = x+y[:Score]
maxScore(x,y) = max(x,y[:Score])

有了这些定义:
julia> eltype(df[:Score])<:Real || ( df[:Score] = parse.(Float64, df[:Score]) );

julia> foldmap(inc1, 0, df, :Location)
Dict{String,Int64} with 4 entries:
  "DC" => 1
  "SF" => 3
  "NY" => 3
  "TX" => 3

julia> folddf(sumScore, 0.0, df, :Location)
4×2 DataFrames.DataFrame
│ Row │ Location │ res  │
├─────┼──────────┼──────┤
│ 1   │ "DC"     │ 1.0  │
│ 2   │ "SF"     │ 11.0 │
│ 3   │ "NY"     │ 9.0  │
│ 4   │ "TX"     │ 9.0  │

julia> folddf(maxScore, 0.0, df, :Location)
4×2 DataFrames.DataFrame
│ Row │ Location │ res │
├─────┼──────────┼─────┤
│ 1   │ "DC"     │ 1.0 │
│ 2   │ "SF"     │ 5.0 │
│ 3   │ "NY"     │ 4.0 │
│ 4   │ "TX"     │ 4.0 │

关于julia - 如何在 Julia 中按操作进行快速分组?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47420359/

相关文章:

julia - 静态数组和统计库

dataframe - 如何在 Julia 中嵌套/取消嵌套数据框?

julia - 为许多类似功能实现多次调度的有效方法

constants - 为什么在这些 julia 函数中不尊重常量性?

dataframe - 用 Julia Dataframe 中另一列的值替换缺失值

regex - 基于Julia中正则表达式的分割线

julia - 函数名后面的感叹号是什么意思?

julia - Julia 中的多维数组理解

julia - 如何使规范化适用于 Julia 中的所有类型的数组?

julia - 在 julia 中创建用户类型的 View