r - dplyr 使用 if else 有条件地按组过滤

在 dplyr 中使用 group_by 后，如果少于 x 行，而如果多于 x 行，我想使用 filter 对组中的所有行进行采样我想从这些组中对特定数量的行进行子采样。我将用按净度分组的钻石数据集进行说明。

diamonds %>%
    group_by(clarity) %>%
    summarise(count = n())
# A tibble: 8 x 2
  clarity count
  <ord>   <int>
1 I1        741
2 SI2      9194
3 SI1     13065
4 VS2     12258
5 VS1      8171
6 VVS2     5066
7 VVS1     3655
8 IF       1790

使用此示例，我想对清晰度组中的所有行进行采样(如果它们有 5066 或更少的行)，而在超过 5066 行的组中，我想使用 sample_n 而无需替换来随机采样 5000 行。仅当 size 等于或小于最小组中的行数时，不进行替换的 sample_n 才有效。在尝试了很多事情之后我陷入了困境，但这是我的思考过程的一个例子。

diamonds %>%
  group_by(clarity) %>%
  if_else(n() > 5066, sample_n(size = 5000, replace = F), filter())

我对 dplyr 还很陌生，但总体上仍然熟悉 R。我确信这是相对容易的事情，但我没有看到发布明确的解决方案。提前致谢!

编辑:

我非常想要以下代码的输出，但在一行代码中。

# groups below or equal to 5066
low_sample_groups <- diamonds %>%
  group_by(clarity) %>% 
  filter( n() <= 5066)

# groups above 5066
high_sample_groups <- diamonds %>%
  group_by(clarity) %>% 
  filter( n() > 5066) %>%
  sample_n(size = 5000, replace = F)

desired_result <- full_join(low_sample_groups, high_sample_groups)

编辑第 2 轮

在这里找到了我正在寻找的答案:custom grouped dplyr function (sample_n)

本质上这是使用 if 语句的解决方案

n <- 5066
desired_result <- diamonds %>%
  group_by(clarity) %>% 
  sample_n(if(n() < n) n() else n)

最佳答案

我们可以首先按所需变量对数据框进行拆分“分组”，然后根据每组内的观察数量应用“映射”条件采样。

diamonds %>% split(.$clarity) %>% 
             map(function(x) if (nrow(x) <= 5066) sample_n(size = nrow(x), replace = F,x) else sample_n(size = 5000, replace = F,x)) %>% bind_rows()

更简洁

Sample_FUN <- function(x){
              if (nrow(x) <= 5000) sample_n(size = nrow(x), replace = F,x) 
              else sample_n(size = 5066, replace = F,x)
              } 

diamonds %>% split(.$clarity) %>% map(Sample_FUN) %>% bind_rows()

关于r - dplyr 使用 if else 有条件地按组过滤，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50938473/

r - dplyr 使用 if else 有条件地按组过滤

上一篇：jmeter - 金牛座从网上下载Jmeter，而不是引用本地安装

下一篇：r - 保存并加载 ggplot 图