r - 用户定义的函数，用于分析多个 csv、创建变量并在 R 中按组对 NA 进行计数

我有大约 100 个具有不同变量(以及不同数量的变量)的数据集，但每个数据集都有一个家庭 ID (hh_ID) 作为标识符。变量代表调查问题。每个 csv 代表一种不同类型的调查。我想编写一个自定义函数来计算一个家庭被问问题的次数以及他们跳过问题的次数 (NA)。我遇到的问题是重命名变量并在 csv 中进行计数。

假设两个数据框如下所示:

hh_ID <- c(1,1,2,2,2)
question1 <- c(NA,1,0,0,0)
question2 <- c(1,1,NA,0,0)
df1 <- data.frame(hh_ID, question1, question2)

hh_ID <- c(1,1,1,2,2)
question3 <- c(NA,NA,0,0,0)
question4 <- c(1,1,1,NA,NA)
df2 <- data.frame(hh_ID, question3, question4)

## > df1
##   hh_ID question1 question2
## 1     1        NA         1
## 2     1         1         1
## 3     2         0        NA
## 4     2         0         0
## 5     2         0         0
## > df2
##   hh_ID question3 question4
## 1     1        NA         1
## 2     1        NA         1
## 3     1         0         1
## 4     2         0        NA
## 5     2         0        NA

我需要最终的数据框如下所示:

question1_count <- c(2,3)
question1_NAs   <- c(1,0)
question2_count <- c(2,3)
question2_NAs   <- c(0,1)
question3_count <- c(3,2)
question3_NAs   <- c(2,0)
question4_count <- c(3,2)
question4_NAs <- c(0,2)
finaldf <- data.frame(unique(hh_ID),question1_count, question1_NAs,question2_count,question2_NAs,question3_count,question3_NAs, question4_count,question4_NAs) 

## > finaldf
##   unique.hh_ID. question1_count question1_NAs question2_count question2_NAs question3_count question3_NAs question4_count question4_NAs
## 1             1               2             1               2             0               3             2               3             0
## 2             2               3             0               3             1               2             0               2             2

这是我到目前为止所拥有的:

# read in each dta file
filenames <- list.files(path=mydirectory, pattern=".*dta")
for (i in 1:length(filenames)){
assign(filenames[i], read_dta(paste("", filenames[i], sep=''))
)}

variable_NA_count <- function(dataset, col_name){
temp <- dataset %>% group_by(hh_ID) %>% summarise(question_count = n()) 
temp1 <- aggregate(col_name ~ hh_ID, data=dataset, function(x) {sum(is.na(x))}, na.action = NULL)
final <- merge(temp, temp1, by = "hh_ID")
return(final)}

frequency <- function(dataset, col_name){
temp <- variable_NA_count(dataset, col_name)
temp <- temp %>% select(question1_count = question_count,
                        question1_NAs = col_name)}

问题是我希望每个变量名称以“_count”和“_NAs”结尾，而不明确写入“question1_count = Question_count”。我的 csv 中有数百个变量，因此我需要一个函数来读取每个 csv、读取每个列名称、计算一个家庭被问到问题的次数以及他们没有回答的次数。我尝试过各种方法，例如粘贴功能，但始终碰壁。

谢谢!

最佳答案

您可以充分利用 dplyr 的 summarize_all 函数:

它将使用一个或多个给定函数汇总 df 中的所有列，创建智能列名称(从原始列名称开始并添加函数名称)。

library(dplyr)

df1 %>%
  group_by(hh_ID) %>% 
  summarize_all(.funs = list(count = ~n(), NAs = ~sum(is.na(.))))
#> # A tibble: 2 x 5
#>   hh_ID question1_count question2_count question1_NAs question2_NAs
#>   <dbl>           <int>           <int>         <int>         <int>
#> 1     1               2               2             1             0
#> 2     2               3               3             0             1

^{由reprex package于2020年4月1日创建(v0.3.0)}

我们可以使用 purrr 的 map 函数将相同的操作应用于数据帧列表:

library(dplyr)
library(purrr)

list(df1, df2) %>% 
  map(~{
    .x %>%
      group_by(hh_ID) %>% 
      summarize_all(.funs = list(count = ~n(), NAs = ~sum(is.na(.))))
  }) %>% 
  reduce(full_join)
#> Joining, by = "hh_ID"
#> # A tibble: 2 x 9
#>   hh_ID question1_count question2_count question1_NAs question2_NAs
#>   <dbl>           <int>           <int>         <int>         <int>
#> 1     1               2               2             1             0
#> 2     2               3               3             0             1
#> # … with 4 more variables: question3_count <int>, question4_count <int>,
#> #   question3_NAs <int>, question4_NAs <int>

^{由reprex package于2020年4月1日创建(v0.3.0)}

map 返回数据帧列表，但我们希望使用 full_join(或您认为合适的任何其他 *_join)连接它们

最后我们可以将其粘合在一起读取文件:list.files(path=mydirectory,pattern=".*dta")返回一个字符向量，我们可以应用map 到此。

对于每个文件，阅读它，总结并加入:

library(dplyr)
library(purrr)
library(haven)

list.files(path=mydirectory, pattern=".*dta") %>% 
  map(~{
    read_dta(.x) %>%
      group_by(hh_ID) %>% 
      summarize_all(.funs = list(count = ~n(), NAs = ~sum(is.na(.))))
  }) %>% 
  reduce(full_join)

^{由reprex package于2020年4月1日创建(v0.3.0)}

(输出未显示，因为我没有任何包含 *.dta 文件的目录)

关于r - 用户定义的函数，用于分析多个 csv、创建变量并在 R 中按组对 NA 进行计数，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/60971251/

r - 用户定义的函数，用于分析多个 csv、创建变量并在 R 中按组对 NA 进行计数

上一篇：python - 为什么 LightGBM 输出 : Finished loading model, 总共使用的 X 次迭代？

下一篇：c# - 为什么我的 winforms 应用程序中的自动版本控制没有增加？