r - 使用 dplyr 在 R 中按组比较均值 (ANOVA)

我有针对不同子组(例如按类(class)、年龄组、性别)的调查问题的汇总结果(N、平均值、标准差)。我想确定那些存在具有统计意义的条目的子组，以便能够进一步探究结果。理想情况下，这一切都应该在使用 tidyverse/dplyr 为 R Markdown 中的报告准备数据的过程中进行。

我的数据是这样的:

> head(demo, 11)
# A tibble: 11 x 7
# Groups:   qid, subgroup [3]
     qid question subgroup name       N  mean    sd
   <int> <chr>    <chr>    <chr>  <dbl> <dbl> <dbl>
 1     1 noise     NA       total   214  3.65 1.03
 2     1 noise     course   A       11  4     0.77
 3     1 noise     course   B       47  3.55  1.16
 4     1 noise     course   C       31  3.29  1.24
 5     1 noise     course   D       40  3.8   0.85
 6     1 noise     course   E       16  3.38  1.09
 7     1 noise     course   F       11  3.55  1.13
 8     1 noise     course   G       25  4.12  0.73
 9     1 noise     course   H       25  3.68  0.85
10     1 noise     gender   f       120 3.65  1.07
11     1 noise     gender   m       93  3.67  0.98

我想要的是一个新列，如果给定问题的子组内存在统计显着差异，则指示 TRUE，否则指示 FALSE。就像下面的 sigdiff:

     qid question subgroup name       N  mean    sd     sigdiff     
   <int> <chr>    <chr>    <chr>  <dbl> <dbl> <dbl>       <lgl>
 2     1 noise     course   A       11  4     0.77        FALSE
 3     1 noise     course   B       47  3.55  1.16        FALSE 
 4     1 noise     course   C       31  3.29  1.24        FALSE 
 5     1 noise     course   D       40  3.8   0.85        FALSE 
 6     1 noise     course   E       16  3.38  1.09        FALSE 
 7     1 noise     course   F       11  3.55  1.13        FALSE 
 8     1 noise     course   G       25  4.12  0.73        FALSE 
 9     1 noise     course   H       25  3.68  0.85        FALSE

现在，解决这个问题的一种非常巧妙的方法似乎是通过调整 this approach 来确定任何组之间是否存在显着差异。基于 rpsychi 包。

我失败了，但是我没能将其应用到我的分组小标题中。我的(失败的)方法是尝试通过 dplyr 的 newish group_map 简单地调用一个执行方差分析的函数。 :

if(!require(rpsychi)){install.packages("rpsychi")}
library(rpsychi)
if(!require(tidyverse)){install.packages("tidyverse")}
library(tidyverse)

#' function establishing significant difference
#' between survey answers within subgroups

anovagrptest <- function(grpsum){
  
      anovaresult <- ind.oneway.second(grpsum$mean, grpsum$sd, grpsum$N, sig.level = 0.05)
      
      # compare critical F Value
      fcrit <- qf(.95, anovaresult$anova.table$df[1], anovaresult$anova.table$df[2])
      if(anovaresult$anova.table$F[1] > fcrit){return(TRUE)
      }else{return(FALSE)}
    }

#' pass the subset of the data for the group to the function which 
#' "returns a list of results from calling .f on each group"

relquestions <- demo %>% 
  group_by(qid, subgroup) %>% 
  group_map(~ anovagrptest(.x))

代码因“delta.upper + dfb 错误:二元运算符的非数字参数”而中止。非常感谢您的想法。

最佳答案

我认为您与 NA 的行导致了您的问题。首先:我认为您不需要那个映射函数(但说实话，我不是 100% 确定)。

demo %>% 
  select(-id) %>%
  group_by(qid, subgroup) %>%
  mutate(new_column = ind.oneway.second(mean, sd, N, sig.level = 0.05) %>%
           {qf(.95, .[["anova.table"]][["df"]][1], .[["anova.table"]][["df"]][2]) < .[["anova.table"]][["F"]][1]})

原因

Error: Problem with `mutate()` input `new_column`.
x non-numeric argument for binary operator
i Input `new_column` is ``%>%`(...)`.
i The error occured in group 3: qid = 1, subgroup = NA.
Run `rlang::last_error()` to see where the error occurred.

当我删除包含 NA

的行时

demo %>% 
  select(-id) %>%
  group_by(qid, subgroup) %>%
  drop_na() %>%
  mutate(new_column = ind.oneway.second(mean, sd, N, sig.level = 0.05) %>%
           {qf(.95, .[["anova.table"]][["df"]][1], .[["anova.table"]][["df"]][2]) < .[["anova.table"]][["F"]][1]})

我明白了

# A tibble: 10 x 8
# Groups:   qid, subgroup [2]
     qid question subgroup name      N  mean    sd new_column
   <dbl> <chr>    <chr>    <chr> <dbl> <dbl> <dbl> <lgl>  
 1     1 noise    course   A        11  4     0.77 FALSE  
 2     1 noise    course   B        47  3.55  1.16 FALSE  
 3     1 noise    course   C        31  3.29  1.24 FALSE  
 4     1 noise    course   D        40  3.8   0.85 FALSE  
 5     1 noise    course   E        16  3.38  1.09 FALSE  
 6     1 noise    course   F        11  3.55  1.13 FALSE  
 7     1 noise    course   G        25  4.12  0.73 FALSE  
 8     1 noise    course   H        25  3.68  0.85 FALSE  
 9     1 noise    gender   f       120  3.65  1.07 FALSE  
10     1 noise    gender   m        93  3.67  0.98 FALSE

关于r - 使用 dplyr 在 R 中按组比较均值 (ANOVA)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62491357/

r - 使用 dplyr 在 R 中按组比较均值 (ANOVA)

上一篇：python - 如何从数据流中的谷歌存储桶中读取csv文件，合并，对数据流中的数据帧进行一些转换，然后将其转储到bigquery中？

下一篇：python - pipenv 搜索旧目录中的 python 不存在