r - 应用group_by和summarise(sum),但保留具有不相关冲突数据的列?

标签 r group-by tidyverse mutate summarize

我的问题与Applying group_by and summarise on data while keeping all the columns' info非常相似
但我想保留被排除的列,因为它们在分组后会发生冲突。

Label <- c("203c","203c","204a","204a","204a","204a","204a","204a","204a","204a")
Type <- c("wholefish","flesh","flesh","fleshdelip","formula","formuladelip",
          "formula","formuladelip","wholefish", "wholefishdelip")
Proportion <- c(1,1,0.67714,0.67714,0.32285,0.32285,0.32285, 
                0.32285, 0.67714,0.67714)
N <- (1:10)
C <- (1:10)
Code <- c("c","a","a","b","a","b","c","d","c","d")

df <- data.frame(Label,Type, Proportion, N, C, Code)
df

   Label           Type Proportion  N  C Code
1   203c      wholefish     1.0000  1  1    c
2   203c          flesh     1.0000  2  2    a
3   204a          flesh     0.6771  3  3    a
4   204a     fleshdelip     0.6771  4  4    b
5   204a        formula     0.3228  5  5    a
6   204a   formuladelip     0.3228  6  6    b
7   204a        formula     0.3228  7  7    c
8   204a   formuladelip     0.3228  8  8    d
9   204a      wholefish     0.6771  9  9    c
10  204a wholefishdelip     0.6771 10 10    d

total <- df %>% 
  #where the Label and Code are the same the Proportion, N and C 
  #should be added together respectively
  group_by(Label, Code) %>% 
  #total proportion should add up to 1 
  #my way of checking that the correct task has been completed
  summarise_if(is.numeric, sum)

# A tibble: 6 x 5
# Groups:   Label [?]
   Label   Code Proportion     N     C
  <fctr> <fctr>      <dbl> <int> <int>
1   203c      a    1.00000     2     2
2   203c      c    1.00000     1     1
3   204a      a    0.99999     8     8
4   204a      b    0.99999    10    10
5   204a      c    0.99999    16    16
6   204a      d    0.99999    18    18

直到这里,我得到了我想要的。现在,我想包括“类型”列,但由于值冲突而被排除在外。这是我想要获得的结果
# A tibble: 6 x 5
# Groups:   Label [?]
   Label   Code Proportion     N     C    Type
  <fctr> <fctr>      <dbl> <int> <int>  <fctr>
1   203c      a    1.00000     2     2    wholefish
2   203c      c    1.00000     1     1    flesh
3   204a      a    0.99999     8     8    flesh_formula
4   204a      b    0.99999    10    10    fleshdelip_formuladelip
5   204a      c    0.99999    16    16    wholefish_formula
6   204a      d    0.99999    18    18    wholefishdelip_formuladelip

我已经尝试过ungroup()以及mutateunite的一些变体,但无济于事,任何建议将不胜感激

最佳答案

这是data.table解决方案,我假设您需要比例的mean(),因为这些分组的比例可能不是可加的。

setDT(df)

df[, .(Type =paste(Type,collapse="_"), 
  Proportion=mean(Proportion),N= sum(N),C=sum(C)), by=.(Label,Code)]
  [order(Label)]

   Label Code                        Type Proportion  N  C
1:  203c    c                   wholefish   1.000000  1  1
2:  203c    a                       flesh   1.000000  2  2
3:  204a    a               flesh_formula   0.499995  8  8
4:  204a    b     fleshdelip_formuladelip   0.499995 10 10
5:  204a    c           formula_wholefish   0.499995 16 16
6:  204a    d formuladelip_wholefishdelip   0.499995 18 18

我不确定这是否是最干净的dplyr解决方案,但它可以正常工作:
df %>% group_by(Label, Code) %>% 
  mutate(Type = paste(Type,collapse="_")) %>% 
  group_by(Label,Type,Code) %>% 
  summarise(N=sum(N),C=sum(C),Proportion=mean(Proportion))

请注意,这里的关键是在创建组合的Type列后重新分组。
   Label                        Type   Code     N     C Proportion
  <fctr>                       <chr> <fctr> <int> <int>      <dbl>
1   203c                       flesh      a     2     2   1.000000
2   203c                   wholefish      c     1     1   1.000000
3   204a               flesh_formula      a     8     8   0.499995
4   204a     fleshdelip_formuladelip      b    10    10   0.499995
5   204a           formula_wholefish      c    16    16   0.499995
6   204a formuladelip_wholefishdelip      d    18    18   0.499995

关于r - 应用group_by和summarise(sum),但保留具有不相关冲突数据的列?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46553514/

相关文章:

python - 使用 groupby 的 Pandas 占总数的百分比

r - 将参数传递给包含 dplyr 管道表达式 group_by 和 stringr::str_extract 的函数

r - 根据一列中的最大值和唯一值过滤行

r - sparklyr 中的完整数据框

R - 如何将文件从一个位置复制并粘贴到 s3 存储桶中的另一个位置(使用 aws.s3)?

mysql - SQL聚合查询

r - lapply 和 sapply 在 data.frame 上测试每列中的所有元素以获得单个逻辑

r - 在 R 中创建具有定义相关性的正态分布变量

r - 使用有光泽的 dygraph

Pandas 在满足条件时计算列内的出现次数