r - 使用列名称字符串在箭头中收集之前进行汇总

假设我想在收集之前总结 arrow 表中的一列(因为实际数据集大于内存)。我可以做这样的事情:

arrow_table(mtcars) %>% 
  summarise(mean(mpg)) %>% 
  collect()

# A tibble: 1 × 1
#     `mean(mpg)`
#           <dbl>
#   1        20.1

现在，假设我想以编程方式执行此操作，并且列名称以字符串形式提供。在常规(即非arrow)dplyr中，我可以使用across和all_of，如下所示:

foo_regular <- function(x){
  mtcars %>% 
    summarise(across(all_of(x), mean)) %>% 
    collect()
}

foo_regular("mpg")

#        mpg
# 1 20.09062

但是如何在 arrow 中执行此操作？

foo_arrow <- function(x){
  arrow_table(mtcars) %>%
    summarise(across(all_of(x), mean)) %>%
    collect()
}

foo_arrow("mpg")

# Warning: Error in summarize_eval(names(exprs)[i], exprs[[i]], ctx, length(.data$group_by_vars) >  : 
# Expression across(all_of(x), mean) is not an aggregate expression or is not supported in Arrow; pulling data into R
# Error:
#   ! Problem while computing `..1 = across(all_of(x), mean)`.
# Caused by error in `across()`:
#   ! Can't subset columns that don't exist.
# ✖ Column `mpg` doesn't exist.
# Run `rlang::last_error()` to see where the error occurred.

显然，在 arrow 中收集之前，可以在该列上执行平均值，因为我的第一个代码块就是这样做的，但是如何用字符串指定列名称？正如我所说，实际数据集非常庞大，因此首先将数据提取到 R 中并不是一种选择。

最佳答案

[编辑添加:不再需要以下建议； arrow 10.0.0版本现已发布，支持across()]

在Arrow最新发布的版本(9.0.0.1)中，across()尚未实现，但在最新的开发版本中已经实现了，所以应该在即将发布的版本 (10.0.0)。

目前，您可以通过 arrow::install_arrow(nightly = TRUE) 安装 arrow 的夜间版本，这将成功运行您的代码示例，或者手动指定要运行的列/函数summarise() 而不使用 across()。

关于r - 使用列名称字符串在箭头中收集之前进行汇总，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/73897223/

r - 使用列名称字符串在箭头中收集之前进行汇总

上一篇：c# - 如何使用 linq 在一个条件下检查列表中的多个值？

下一篇：laravel - 由于 html 代码中存在空格，assertSee 失败