julia - julia 数据帧中的 Pandas value_counts 有什么更好的等价物吗？

我正在为 julia 中的数据帧中的一个系列寻找 Pandas 中非常方便的 value_counts 的等效项。
不幸的是，我在这里找不到任何东西，因此我对 julia 数据框中的 value_counts 的解决方案如下。但是，我不太喜欢我的解决方案，因为与使用方法 .value_counts() 的 Pandas 相比，它并不方便。 .所以我的问题是，还有其他(更方便)的选择吗？

jdf = DataFrame(rand(Int8, (1000000, 3)))

这给了我:

│ Row     │ x1   │ x2   │ x3   │
│         │ Int8 │ Int8 │ Int8 │
├─────────┼──────┼──────┼──────┤
│ 1       │ -97  │ 98   │ 79   │
│ 2       │ -77  │ -118 │ -19  │
⋮
│ 999998  │ -115 │ 17   │ 107  │
│ 999999  │ -43  │ -64  │ 72   │
│ 1000000 │ 40   │ -11  │ 31   │

第一列的值计数为:

combine(nrow,groupby(jdf,:x1))

│ Row │ x1   │ nrow  │
│     │ Int8 │ Int64 │
├─────┼──────┼───────┤
│ 1   │ -97  │ 3942  │
│ 2   │ -77  │ 3986  │
⋮
│ 254 │ 12   │ 3899  │
│ 255 │ -92  │ 3973  │
│ 256 │ -49  │ 3952  │

最佳答案

在 DataFrames.jl 中，这是获得所需结果的方法。通常，DataFrames.jl 中的方法是使用最少的 API。如果您使用 combine(nrow,groupby(jdf,:x1))通常，您可以定义:

value_counts(df, col) = combine(groupby(df, col), nrow)

在你的脚本中。
使用 FreqTables.jl 或 StatsBase.jl 实现您想要的替代方法:

julia> freqtable(jdf, :x1)
256-element Named Array{Int64,1}
x1   │
─────┼─────
-128 │ 3875
-127 │ 3931
-126 │ 3924
⋮         ⋮
125  │ 3873
126  │ 3917
127  │ 3975

julia> countmap(jdf.x1)
Dict{Int8,Int64} with 256 entries:
  -98  => 3925
  -74  => 4054
  11   => 3798
  -56  => 3853
  29   => 3765
  -105 => 3918
  ⋮    => ⋮

(不同的是输出类型会有所不同)
在性能方面countmap是最快的，而且 combine最慢:

julia> using BenchmarkTools

julia> @benchmark countmap($jdf.x1)
BenchmarkTools.Trial:
  memory estimate:  16.80 KiB
  allocs estimate:  14
  --------------
  minimum time:     436.000 μs (0.00% GC)
  median time:      443.200 μs (0.00% GC)
  mean time:        455.244 μs (0.22% GC)
  maximum time:     5.362 ms (91.59% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark freqtable($jdf, :x1)
BenchmarkTools.Trial:
  memory estimate:  37.22 KiB
  allocs estimate:  86
  --------------
  minimum time:     7.972 ms (0.00% GC)
  median time:      8.089 ms (0.00% GC)
  mean time:        8.158 ms (0.00% GC)
  maximum time:     10.016 ms (0.00% GC)
  --------------
  samples:          613
  evals/sample:     1

julia> @benchmark combine(groupby($jdf,:x1), nrow)
BenchmarkTools.Trial:
  memory estimate:  23.28 MiB
  allocs estimate:  183
  --------------
  minimum time:     12.679 ms (0.00% GC)
  median time:      14.572 ms (8.68% GC)
  mean time:        15.239 ms (14.50% GC)
  maximum time:     20.385 ms (21.83% GC)
  --------------
  samples:          328
  evals/sample:     1

请注意，在 combine 中大部分成本是分组的，所以如果你有 GroupedDataFrame对象已经创建然后 combine比较快:

julia> gdf = groupby(jdf,:x1);

julia> @benchmark combine($gdf, nrow)
BenchmarkTools.Trial:
  memory estimate:  16.16 KiB
  allocs estimate:  152
  --------------
  minimum time:     680.801 μs (0.00% GC)
  median time:      714.800 μs (0.00% GC)
  mean time:        737.568 μs (0.15% GC)
  maximum time:     4.561 ms (83.47% GC)
  --------------
  samples:          6766
  evals/sample:     1

编辑
如果你想要一个排序的字典然后加载 DataStructures.jl 然后执行:

sort!(OrderedDict(countmap(jdf.x1)))

或者

 sort!(OrderedDict(countmap(jdf.x1)), byvalue=true)

取决于你想对字典进行排序。

关于julia - julia 数据帧中的 Pandas value_counts 有什么更好的等价物吗？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/63100620/

julia - julia 数据帧中的 Pandas value_counts 有什么更好的等价物吗？

上一篇：c# - ComboBox OwnerDrawVariable 字体格式大小问题

下一篇：javascript - k6 负载测试 - 如何从 HTML 响应中提取值