我有两个数据框,一个是调查数据(data.csv),另一个是标签数据(label.csv)。这是示例数据(我的原始数据大约有 150 个变量)
#sample data
df <- tibble::tribble(
~id, ~House_member, ~dob, ~age_quota, ~work, ~sex, ~pss,
1L, 4L, 1983L, 2L, 2L, 1, 1,
2L, 1L, 1940L, 7L, 2L, 1, 2,
3L, 2L, 1951L, 5L, 6L, 1, 1,
4L, 4L, 1965L, 2L, 2L, 1, 4,
5L, 3L, 1965L, 2L, 3L, 1, 1,
6L, 1L, 1951L, 3L, 1L, 1, 3,
7L, 1L, 1955L, 1L, 1L, 1, 3,
8L, 4L, 1982L, 2L, 2L, 2, 5,
9L, 2L, 1990L, 2L, 4L, 2, 3,
10L, 2L, 1953L, 3L, 2L, 2, 4
)
#sample label data
label <- tibble::tribble(
~variable, ~value, ~label,
"House_member", NA, "How many people live with you?",
"House_member", 1L, "1 person",
"House_member", 2L, "2 persons",
"House_member", 3L, "3 persons",
"House_member", 4L, "4 persons",
"House_member", 5L, "5 persons",
"House_member", 6L, "6 persons",
"House_member", 7L, "7 persons",
"House_member", 8L, "8 persons",
"House_member", 9L, "9 persons",
"House_member", 10L, "10 or more",
"dob", NA, "date of brith",
"age_quota", NA, "age_quota",
"age_quota", 1L, "10-14",
"age_quota", 2L, "15-19",
"age_quota", 3L, "20-29",
"age_quota", 4L, "30-39",
"age_quota", 5L, "40-49",
"age_quota", 6L, "50-70",
"age_quota", 7L, "70 +",
"work", NA, "what is your occupation?",
"work", 1L, "full time",
"work", 2L, "part time",
"work", 3L, "retired",
"work", 4L, "student",
"work", 5L, "housewife",
"work", 6L, "unemployed",
"work", 7L, "other",
"work", 8L, "kid under 15",
"sex", NA, "gender?",
"sex", 1L, "Man",
"sex", 2L, "Woman",
"pss", NA, "How often do you use PS?",
"pss", 1L, "Daily",
"pss", 2L, "several times per week",
"pss", 3L, "once per week",
"pss", 4L, "several time per month",
"pss", 5L, "Rarly"
)
我想知道有什么方法可以将这些文件组合在一起以获得一个标记的数据框,例如 SPSS
的样式格式(dbl+lbl 格式)。我知道 labelled
可以向未标记的向量添加值标签的包,如下例所示:v <- labelled::labelled(c(1,2,2,2,3,9,1,3,2,NA), c(yes = 1, maybe = 2, no = 3))
我希望有一种比一个一个地为每个变量添加标签更好/更快的方法。
最佳答案
另一个 imap_dfc
解决方案:
library(tidyverse)
df %>% imap_dfc(~{
label[label$variable==.y,c('label','value')] %>%
deframe() %>% # to named vector
haven::labelled(.x,.)
})
# A tibble: 10 x 7
id House_member dob age_quota work sex pss
<int+lbl> <int+lbl> <int+lbl> <int+lbl> <int+lbl> <dbl+lbl> <dbl+lbl>
1 1 4 [4 persons] 1983 2 [15-19] 2 [part time] 1 [Man] 1 [Daily]
2 2 1 [1 person] 1940 7 [70 +] 2 [part time] 1 [Man] 2 [several times per week]
3 3 2 [2 persons] 1951 5 [40-49] 6 [unemployed] 1 [Man] 1 [Daily]
4 4 4 [4 persons] 1965 2 [15-19] 2 [part time] 1 [Man] 4 [several time per month]
5 5 3 [3 persons] 1965 2 [15-19] 3 [retired] 1 [Man] 1 [Daily]
6 6 1 [1 person] 1951 3 [20-29] 1 [full time] 1 [Man] 3 [once per week]
7 7 1 [1 person] 1955 1 [10-14] 1 [full time] 1 [Man] 3 [once per week]
8 8 4 [4 persons] 1982 2 [15-19] 2 [part time] 2 [Woman] 5 [Rarly]
9 9 2 [2 persons] 1990 2 [15-19] 4 [student] 2 [Woman] 3 [once per week]
10 10 2 [2 persons] 1953 3 [20-29] 2 [part time] 2 [Woman] 4 [several time per month]
二手 tibble::deframe
和 haven::labelled
包含在 tidyverse
中更换后速度对比
filter
/select
通过直接访问 label
:Waldi <- function() {
df %>% imap_dfc(~{
label[label$variable==.y,c('label','value')] %>%
deframe() %>% # to named vector
haven::labelled(.x,.)})}
Waldi_old <- function() {
df %>% imap_dfc(~{
label %>% filter(variable==.y) %>%
select(label, value) %>%
deframe() %>% # to named vector
haven::labelled(.x,.)
})}
#EDIT : Included TIC33() for-loop solution
microbenchmark::microbenchmark(TIC3(),Waldi(),Anil(),TIC1(),Waldi_old(),Sinh())
Unit: microseconds
expr min lq mean median uq max neval cld
TIC3() 688.0 871.80 982.280 920.95 1005.55 1801.6 100 a
Waldi() 1345.5 1543.60 1804.758 1635.45 1893.75 4306.8 100 b
Anil() 4006.8 4476.65 5188.519 4862.95 5439.10 10163.6 100 c
TIC1() 3898.2 4278.80 5009.927 4774.95 5277.05 12916.2 100 c
Waldi_old() 18712.3 20091.75 21756.140 20609.35 22169.75 33359.8 100 d
Sinh() 22730.9 24093.45 25931.412 24946.00 26614.00 38735.3 100 e
关于r - 将数据文件和标签文件组合在一起,在 R 中拥有一个单一的标签数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67504200/