我怀疑需要/我计划使用但无法工作的包
#Load packages
if(!("pacman" %in% .packages(all.available = T))){
install.packages("pacman")
library("pacman")
}else if(!("pacman" %in% (.packages()))){
library("pacman")
}
p_load(magrittr, plyr, dplyr,
rlang, tibble, tidyr,
purrr)
为此示例生成一些数据:
#For reproducability
set.seed(1)
tib <- tibble(
ID = letters,
A_1 = runif(26),
A_2 = runif(26),
B_1 = runif(26),
B_2 = runif(26),
B_3 = runif(26),
C_1 = runif(26),
C_2 = runif(26),
C_3 = runif(26),
C_4 = runif(26)
)
#Remove some datapoint
for(i in 2:9){
pick_rows <- sample(1:nrow(tib[i]), nrow(tib[i])*.25)
tib[pick_rows, i] <- NA
}
那么我想做的事情的想法如下:
对于每个类别(为每个类别添加一个新列)和行 (ID),检查并标记以下内容:
(a) 所有值都是 NA 吗?标记为“MNAR”
(b) 是否缺少一些但不是全部值?标记为“MAR/MCAR”
(c) 没有缺失值吗?标记为“未丢失”
对我来说,这部分的计算成本应该很低,但在我当前的方法中,这是我的代码中的主要瓶颈。
这是我目前的方法:
for (i in tib %>%
#Only numeric columns contain relevant data
keep(is.numeric) %>%
#Get unique identifiers
colnames() %>% gsub('[0-9]$', '', .) %>% unique()
) {
#Generate a new column
tib[[paste0(i, 'missing')]] <- tib %>%
#Select the conditions columns
select(contains(i)) %>%
#For each row
apply(1, function(x) x %>%
#Check if
{case_when(
#no values, (the most common event)
all(!is.na(.)) ~ 'Not missing',
#all values, (the least most common event)
all(is.na(.)) ~ 'MNAR',
#or any values (the second most common event)
any(is.na(.)) ~ 'MAR/MCAR'
#are missing
)}
)
}
我正在尝试开发的方法,因为我认为它会提供更好的速度:
categories <- tib %>%
keep(is.numeric) %>%
colnames() %>%
gsub('[0-9]$', '', .) %>%
unique()
tib %>%
mutate_at(
vars(syms(grep(paste0(categories, collapse = '|'),
colnames(tib),
value = T))),
funs(missing = case_when(
#no values
all(!is.na(.)) ~ 'Not missing',
#or all values
all(is.na(.)) ~ 'MNAR',
#any values
any(is.na(.)) ~ 'MAR/MCAR'
#are missing
)
)
)
这显然不起作用,但我认为这是一些适合我正在尝试的伪代码。派对它需要从 purrr 调用 map,但此时我什至无法 mutate 来识别正确的列组(我一直在为此使用更原始的代码)。
在 StackOverflow 中搜索我发现了以下线程:
dplyr - mutate formula based on similarities in column names
Conditionally mutate columns based on column class
dplyr mutate multiple columns based on names in vectors
Mutate multiple columns in a dataframe
我不能说其中任何内容与我的问题相关。
编辑:
期望的输出:
> tib
# A tibble: 26 x 13
ID A_1 A_2 B_1 B_2 B_3 C_1 C_2 C_3 C_4 A_missing B_missing C_missing
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
1 a 0.266 0.0134 0.438 0.777 0.633 0.575 0.530 NA 0.256 Not missi~ Not missi~ MAR/MCAR
2 b 0.372 0.382 0.245 0.961 0.213 NA NA 0.503 0.718 Not missi~ Not missi~ MAR/MCAR
3 c 0.573 0.870 0.0707 NA 0.129 0.0355 NA 0.877 0.961 Not missi~ MAR/MCAR MAR/MCAR
4 d 0.908 NA NA 0.713 0.478 NA NA 0.189 0.100 MAR/MCAR MAR/MCAR MAR/MCAR
5 e 0.202 NA 0.316 0.400 0.924 NA NA NA 0.763 MAR/MCAR Not missi~ MAR/MCAR
6 f 0.898 0.600 0.519 NA 0.599 0.598 0.895 0.724 0.948 Not missi~ MAR/MCAR Not missi~
7 g 0.945 0.494 0.662 0.757 NA 0.561 NA NA 0.819 Not missi~ MAR/MCAR MAR/MCAR
8 h 0.661 NA 0.407 0.203 NA 0.526 0.780 0.548 0.308 MAR/MCAR MAR/MCAR Not missi~
9 i 0.629 0.827 0.913 0.711 0.357 0.985 0.881 0.712 0.650 Not missi~ Not missi~ Not missi~
10 j NA NA 0.294 0.122 NA 0.508 NA 0.389 0.953 MNAR MAR/MCAR MAR/MCAR
# ... with 16 more rows
最佳答案
一个选项是split
,然后使用map/pmap
library(tidyverse)
f1 <- function(x) case_when(all(!is.na(x)) ~ "Not missing",
all(is.na(x)) ~ "MNAR",
any(is.na(x)) ~ "MAR/MCAR")
tib %>%
keep(is.numeric) %>%
split.default(str_remove(names(.), '_\\d+')) %>%
map_df(~ .x %>%
pmap_chr(~ f1(c(...)))) %>%
rename_all(~ paste0(., '_missing')) %>%
bind_cols(tib, .)
# A tibble: 26 x 13
# ID A_1 A_2 B_1 B_2 B_3 C_1 C_2 C_3 C_4 A_missing B_missing C_missing
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
# 1 a 0.266 0.0134 0.438 0.777 0.633 0.575 0.530 NA 0.256 Not missing Not missing MAR/MCAR
# 2 b 0.372 0.382 0.245 0.961 0.213 NA NA 0.503 0.718 Not missing Not missing MAR/MCAR
# 3 c 0.573 0.870 0.0707 NA 0.129 0.0355 NA 0.877 0.961 Not missing MAR/MCAR MAR/MCAR
# 4 d 0.908 NA NA 0.713 0.478 NA NA 0.189 0.100 MAR/MCAR MAR/MCAR MAR/MCAR
# 5 e 0.202 NA 0.316 0.400 0.924 NA NA NA 0.763 MAR/MCAR Not missing MAR/MCAR
# 6 f 0.898 0.600 0.519 NA 0.599 0.598 0.895 0.724 0.948 Not missing MAR/MCAR Not missing
# 7 g 0.945 0.494 0.662 0.757 NA 0.561 NA NA 0.819 Not missing MAR/MCAR MAR/MCAR
# 8 h 0.661 NA 0.407 0.203 NA 0.526 0.780 0.548 0.308 MAR/MCAR MAR/MCAR Not missing
# 9 i 0.629 0.827 0.913 0.711 0.357 0.985 0.881 0.712 0.650 Not missing Not missing Not missing
#10 j NA NA 0.294 0.122 NA 0.508 NA 0.389 0.953 MNAR MAR/MCAR MAR/MCAR
# ... with 16 more rows
或者另一种选择是聚集
为“长”格式,然后在应用函数f1
创建新列后将其传播
回来
tib %>%
gather(key, val, -ID) %>%
separate(key, into = c('key1', 'key2')) %>%
group_by(ID, key1) %>%
mutate(missing = f1(val)) %>%
select(-val, -key2) %>%
distinct() %>%
spread(key1, missing) %>%
rename_at(vars(A:C), ~ paste0(., '_missing')) %>%
left_join(tib, .)
关于r - 根据列名称的共性进行变异,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53366159/