r - 使用从标签派生的缩写以编程方式重命名数据框列

标签 r tidyverse

我有一个带有任意列名称的标记数据框,我想使用标签以非任意方式重命名列。

这是数据框的简化版本:

library(labelled)
library(tidyverse)

df <- tibble(id = "a", B101 = 1, B102 = 2, B103 = 3, B104 = .1)

对于要重命名的列,每个标签有两到三个组成部分(由冒号 + 单个空格分隔):

var_label(df) <-
  list(
    id = "ID",
    B101 = "Estimates: Less than $10,000: Less than 20.0 percent", 
    B102 = "Estimates: $10,000 to $19,999: 20.0 to 24.9 percent",
    B103 = "Estimates: $10,000 to $19,999",
    B104 = "Margins of error: Less than $10,000: Less than 20.0 percent"
  )

因此,每列的标签可能有两个组件(例如,B103)或三个组件(例如,B102)。如果标签没有这三个组件(例如 id),则无需重命名该列。

我想将标签的组成部分缩写如下:

  • 组件 1
    • “估计:” -> e
    • “误差范围:” -> m
  • 组件 2
    • “低于 10,000 美元:”或“低于 10,000 美元”-> i0to9
    • “10,000 美元到 19,999 美元:”或“10,000 美元到 19,999 美元”-> i10to19
  • 组件 3
    • “低于 20.0%”-> p0to19
    • “20.0% 至 24.9%”-> p20to24

然后,我想通过连接各个组件来重命名每个变量,各个组件之间用下划线分隔。显然,以临时方式手动执行此操作很简单:

df %>% 
  rename(e_i0to9_p0to19 = B101,
         e_i10to19_p20to24 = B102,
         e_i10to19 = B103, 
         m_i0to9_p0to19 = B104)

但是我如何使用 tidyverse 原则和包以编程方式完成此任务?

最佳答案

这是一个稍微冗长的解决方案,其目标是高度灵活地应对映射结构或值的任何更改。如果您的问题是一次性的,我推荐这里已经给出的其他很好的答案。最后我会介绍这个解决方案的好处。

首先在表中定义您的映射 - 这使您可以在将来轻松更改它们或在必要时添加新的映射:

library(tidyverse)
labels = list(
  B101 = "Estimates: Less than $10,000: Less than 20.0 percent", 
  B102 = "Estimates: Less than $10,000: 20.0 to 24.9 percent",
  B103 = "Estimates: $10,000 to $19,999",
  B104 = "Margins of error: Less than $10,000: Less than 20.0 percent"
)

components = tribble(
  ~ id, ~ name, ~ new_name,
  1, "Estimates", "e",
  1, "Margins of error", "m",
  2, "Less than $10,000", "i0to9",
  2, "$10,000 to $19,999", "i10to19",
  3, "Less than 20.0 percent", "p0to19",
  3, "20.0 to 24.9 percent", "p20to24"
)

由此我们可以生成一个正则表达式:

component_regex = components %>%
  split(.$id) %>%
  # Fix dollar signs
  map(~ str_replace_all(.x$name, "\\$", "\\\\$")) %>%
  # Include a regex condition for the possibly of there being a colon
  map(~ map_chr(.x, paste0, "[\\:]?")) %>%
  map_chr(paste, collapse = "|") %>%
  # Some components may not be present
  paste0("(", ., ")?") %>%
  # Spaces in between each component
  paste(collapse = "[ ]?")

这是正则表达式:

component_regex
#> [1] "(Estimates[\\:]?|Margins of error[\\:]?)?[ ]?(Less than \\$10,000[\\:]?|\\$10,000 to \\$19,999[\\:]?)?[ ]?(Less than 20.0 percent[\\:]?|20.0 to 24.9 percent[\\:]?)?"

现在我们从每个标签中提取组件来创建一个数据框:

data_labels = labels %>% 
  map(str_match, pattern = component_regex) %>%
  map(as.data.frame) %>% 
  reduce(bind_rows) %>%
  select(-V1) %>%
  map_df(str_replace, pattern = ":$", replacement = "") %>%
  mutate(col_name = names(labels))

# A tibble: 4 x 4
  V2               V3                 V4                     col_name
  <chr>            <chr>              <chr>                  <chr>   
1 Estimates        Less than $10,000  Less than 20.0 percent B101    
2 Estimates        Less than $10,000  20.0 to 24.9 percent   B102    
3 Estimates        $10,000 to $19,999 NA                     B103    
4 Margins of error Less than $10,000  Less than 20.0 percent B104    

现在我们转换这个表,以便我们可以加入之前的 components 表并提取新名称。我将首先显示部分结果,以便您可以看到发生了什么:

data_labels %>%
  pivot_longer(-col_name, names_to = "id") %>%
  # Generate the component id
  mutate(id = as.numeric(str_extract_all(id, "[0-9]+")) - 1) %>%
  inner_join(components, by = c("id", "value" = "name"))

# A tibble: 11 x 4
   col_name    id value                  new_name
   <chr>    <dbl> <chr>                  <chr>   
 1 B101         1 Estimates              e       
 2 B101         2 Less than $10,000      i0to9   
 3 B101         3 Less than 20.0 percent p0to19  
 4 B102         1 Estimates              e       
 5 B102         2 Less than $10,000      i0to9   
 6 B102         3 20.0 to 24.9 percent   p20to24 
 7 B103         1 Estimates              e       
 8 B103         2 $10,000 to $19,999     i10to19 
 9 B104         1 Margins of error       m       
10 B104         2 Less than $10,000      i0to9   
11 B104         3 Less than 20.0 percent p0to19  

请注意,inner_join() 使得数据中省略没有第三个组件的情况。完成方法如下:

new_names = data_labels %>%
  pivot_longer(-col_name, names_to = "id") %>%
  # Generate the component id
  mutate(id = as.numeric(str_extract_all(id, "[0-9]+")) - 1) %>%
  inner_join(components, by = c("id", "value" = "name")) %>%
  group_by(col_name) %>%
  summarise(final_name = paste(new_name[sort(id)], collapse = "_"))

# A tibble: 4 x 2
  col_name final_name     
  <chr>    <chr>          
1 B101     e_i0to9_p0to19 
2 B102     e_i0to9_p20to24
3 B103     e_i10to19      
4 B104     m_i0to9_p0to19 

我们现在只需用新名称替换名称即可:

old_names = intersect(names(df), new_names$col_name)
df %>% 
  rename_with(
    ~ new_names$final_name[which(old_names == .x)], 
    .cols = all_of(old_names)
  )
# A tibble: 1 x 5
  id    e_i0to9_p0to19 e_i0to9_p20to24 e_i10to19 m_i0to9_p0to19
  <chr>          <dbl>           <dbl>     <dbl>          <dbl>
1 a                  1               2         3            0.1

这个解决方案可能看起来很长,但它有一些好处:

  • 映射可以存储在 CSV 文件中并在代码外部进行修改。也就是说,代码实际上并不依赖于您的映射。
  • 您可以添加或删除每个组件的部分内容。
  • 无论缺少任何组件,它都可以工作。
  • 它适用于三个以上的组件。

关于r - 使用从标签派生的缩写以编程方式重命名数据框列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70703733/

相关文章:

r - 如何使用 dplyr 计算一组行与目标行之间的差异?

r - 控制 y 轴上的主要和次要网格线

r - 对 R 中的整洁数据同时执行多个 t.tests

r - 在 Mac OS X Lion 上安装 rgeos 和 rgdal 时出现问题

linux - 如何在安装在 redhat linux 服务器上的 Rstudio 上安装 `data.table 1.9.3` 的开发版本?

r - sys.sleep 使用可能来自 purrr 包的函数

R如何将一个函数作为字符串传递给另一个函数

r - 通过在 R 中组合 mutate 和 case_when 创建新变量

r - 创建 tibble 时设置列名称时出错

r - 当在另一行中发现元素逗号分隔时合并行