r - 将数据文件和标签文件组合在一起,在 R 中拥有一个单一的标签数据框

标签 r dataframe tidyverse purrr r-haven

我有两个数据框,一个是调查数据(data.csv),另一个是标签数据(label.csv)。这是示例数据(我的原始数据大约有 150 个变量)

#sample data

df <- tibble::tribble(
  ~id, ~House_member, ~dob, ~age_quota, ~work, ~sex, ~pss,
  1L,            4L,  1983L,  2L,        2L,     1,      1,
  2L,            1L,  1940L,  7L,        2L,     1,      2,
  3L,            2L,  1951L,  5L,        6L,     1,      1,
  4L,            4L,  1965L,  2L,        2L,     1,      4,
  5L,            3L,  1965L,  2L,        3L,     1,      1,
  6L,            1L,  1951L,  3L,        1L,     1,      3,
  7L,            1L,  1955L,  1L,        1L,     1,      3,
  8L,            4L,  1982L,  2L,        2L,     2,      5,
  9L,            2L,  1990L,  2L,        4L,     2,      3,
  10L,            2L,  1953L, 3L,        2L,     2,      4
)


#sample label data
label <- tibble::tribble(
                ~variable, ~value,                           ~label,
           "House_member",     NA, "How many people live with you?",
           "House_member",     1L,                       "1 person",
           "House_member",     2L,                      "2 persons",
           "House_member",     3L,                      "3 persons",
           "House_member",     4L,                      "4 persons",
           "House_member",     5L,                      "5 persons",
           "House_member",     6L,                      "6 persons",
           "House_member",     7L,                      "7 persons",
           "House_member",     8L,                      "8 persons",
           "House_member",     9L,                      "9 persons",
           "House_member",    10L,                     "10 or more",
                    "dob",     NA,                  "date of brith",
              "age_quota",     NA,                      "age_quota",
              "age_quota",     1L,                          "10-14",
              "age_quota",     2L,                          "15-19",
              "age_quota",     3L,                          "20-29",
              "age_quota",     4L,                          "30-39",
              "age_quota",     5L,                          "40-49",
              "age_quota",     6L,                          "50-70",
              "age_quota",     7L,                           "70 +",
                   "work",     NA,        "what is your occupation?",
                   "work",     1L,                      "full time",
                   "work",     2L,                      "part time",
                   "work",     3L,                        "retired",
                   "work",     4L,                        "student",
                   "work",     5L,                      "housewife",
                   "work",     6L,                     "unemployed",
                   "work",     7L,                          "other",
                   "work",     8L,                   "kid under 15",
                    "sex",     NA,                        "gender?",
                    "sex",     1L,                            "Man",
                    "sex",     2L,                          "Woman",
                    "pss",     NA,       "How often do you use PS?",
                    "pss",     1L,                          "Daily",
                    "pss",     2L,         "several times per week",
                    "pss",     3L,                  "once per week",
                    "pss",     4L,         "several time per month",
                    "pss",     5L,                          "Rarly"
           )
我想知道有什么方法可以将这些文件组合在一起以获得一个标记的数据框,例如 SPSS的样式格式(dbl+lbl 格式)。我知道 labelled可以向未标记的向量添加值标签的包,如下例所示:
v <- labelled::labelled(c(1,2,2,2,3,9,1,3,2,NA), c(yes = 1, maybe = 2, no = 3))
我希望有一种比一个一个地为每个变量添加标签更好/更快的方法。

最佳答案

另一个 imap_dfc解决方案:

library(tidyverse)

df %>% imap_dfc(~{ 
                  label[label$variable==.y,c('label','value')] %>%
                  deframe() %>% # to named vector
                  haven::labelled(.x,.)
                 })

# A tibble: 10 x 7
          id  House_member       dob age_quota           work       sex                        pss
   <int+lbl>     <int+lbl> <int+lbl> <int+lbl>      <int+lbl> <dbl+lbl>                  <dbl+lbl>
 1         1 4 [4 persons]      1983 2 [15-19] 2 [part time]  1 [Man]   1 [Daily]                 
 2         2 1 [1 person]       1940 7 [70 +]  2 [part time]  1 [Man]   2 [several times per week]
 3         3 2 [2 persons]      1951 5 [40-49] 6 [unemployed] 1 [Man]   1 [Daily]                 
 4         4 4 [4 persons]      1965 2 [15-19] 2 [part time]  1 [Man]   4 [several time per month]
 5         5 3 [3 persons]      1965 2 [15-19] 3 [retired]    1 [Man]   1 [Daily]                 
 6         6 1 [1 person]       1951 3 [20-29] 1 [full time]  1 [Man]   3 [once per week]         
 7         7 1 [1 person]       1955 1 [10-14] 1 [full time]  1 [Man]   3 [once per week]         
 8         8 4 [4 persons]      1982 2 [15-19] 2 [part time]  2 [Woman] 5 [Rarly]                 
 9         9 2 [2 persons]      1990 2 [15-19] 4 [student]    2 [Woman] 3 [once per week]         
10        10 2 [2 persons]      1953 3 [20-29] 2 [part time]  2 [Woman] 4 [several time per month]
二手 tibble::deframehaven::labelled包含在 tidyverse
更换后速度对比filter/select通过直接访问 label :
Waldi <- function() {
df %>% imap_dfc(~{ 
    label[label$variable==.y,c('label','value')] %>%
      deframe() %>% # to named vector
      haven::labelled(.x,.)})}

Waldi_old <- function() {   
    df %>% imap_dfc(~{ 
      label %>% filter(variable==.y) %>%
        select(label, value) %>%
        deframe() %>% # to named vector
        haven::labelled(.x,.)
    })}

#EDIT : Included TIC33() for-loop solution

microbenchmark::microbenchmark(TIC3(),Waldi(),Anil(),TIC1(),Waldi_old(),Sinh())
Unit: microseconds
        expr     min       lq      mean   median       uq     max neval   cld
      TIC3()   688.0   871.80   982.280   920.95  1005.55  1801.6   100 a    
     Waldi()  1345.5  1543.60  1804.758  1635.45  1893.75  4306.8   100  b   
      Anil()  4006.8  4476.65  5188.519  4862.95  5439.10 10163.6   100   c  
      TIC1()  3898.2  4278.80  5009.927  4774.95  5277.05 12916.2   100   c  
 Waldi_old() 18712.3 20091.75 21756.140 20609.35 22169.75 33359.8   100    d 
      Sinh() 22730.9 24093.45 25931.412 24946.00 26614.00 38735.3   100     e

关于r - 将数据文件和标签文件组合在一起,在 R 中拥有一个单一的标签数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67504200/

相关文章:

R ggplot2 : stat_count() must not be used with a y aesthetic error in Bar graph

r - 在 rShiny 应用程序中使用markerClusterOptions() 时弹出传单

python - pandas 针对多种条件一次替换多列的内容

r - 如何根据年份用 "Ethiopia"和 "Ethiopia (-1992)"替换 "Ethiopia (1993-)"

r - 根据两列中的值在 R 中创建新列

r - 我怎样才能避免 travis-ci "SSL errors"?

r - 如何在 dplyr 中进行功能重命名?

R:将数据框列名称与数字连接

R - ggplot箱线图,在图中打印标准差值?

html - Facebook 上的 R、rvest 和 selectorGadget