我有一个带有任意列名称的标记数据框,我想使用标签以非任意方式重命名列。
这是数据框的简化版本:
library(labelled)
library(tidyverse)
df <- tibble(id = "a", B101 = 1, B102 = 2, B103 = 3, B104 = .1)
对于要重命名的列,每个标签有两到三个组成部分(由冒号 + 单个空格分隔):
var_label(df) <-
list(
id = "ID",
B101 = "Estimates: Less than $10,000: Less than 20.0 percent",
B102 = "Estimates: $10,000 to $19,999: 20.0 to 24.9 percent",
B103 = "Estimates: $10,000 to $19,999",
B104 = "Margins of error: Less than $10,000: Less than 20.0 percent"
)
因此,每列的标签可能有两个组件(例如,B103)或三个组件(例如,B102)。如果标签没有这三个组件(例如 id),则无需重命名该列。
我想将标签的组成部分缩写如下:
- 组件 1
- “估计:” -> e
- “误差范围:” -> m
- 组件 2
- “低于 10,000 美元:”或“低于 10,000 美元”-> i0to9
- “10,000 美元到 19,999 美元:”或“10,000 美元到 19,999 美元”-> i10to19
- 组件 3
- “低于 20.0%”-> p0to19
- “20.0% 至 24.9%”-> p20to24
然后,我想通过连接各个组件来重命名每个变量,各个组件之间用下划线分隔。显然,以临时方式手动执行此操作很简单:
df %>%
rename(e_i0to9_p0to19 = B101,
e_i10to19_p20to24 = B102,
e_i10to19 = B103,
m_i0to9_p0to19 = B104)
但是我如何使用 tidyverse 原则和包以编程方式完成此任务?
最佳答案
这是一个稍微冗长的解决方案,其目标是高度灵活地应对映射结构或值的任何更改。如果您的问题是一次性的,我推荐这里已经给出的其他很好的答案。最后我会介绍这个解决方案的好处。
首先在表中定义您的映射 - 这使您可以在将来轻松更改它们或在必要时添加新的映射:
library(tidyverse)
labels = list(
B101 = "Estimates: Less than $10,000: Less than 20.0 percent",
B102 = "Estimates: Less than $10,000: 20.0 to 24.9 percent",
B103 = "Estimates: $10,000 to $19,999",
B104 = "Margins of error: Less than $10,000: Less than 20.0 percent"
)
components = tribble(
~ id, ~ name, ~ new_name,
1, "Estimates", "e",
1, "Margins of error", "m",
2, "Less than $10,000", "i0to9",
2, "$10,000 to $19,999", "i10to19",
3, "Less than 20.0 percent", "p0to19",
3, "20.0 to 24.9 percent", "p20to24"
)
由此我们可以生成一个正则表达式:
component_regex = components %>%
split(.$id) %>%
# Fix dollar signs
map(~ str_replace_all(.x$name, "\\$", "\\\\$")) %>%
# Include a regex condition for the possibly of there being a colon
map(~ map_chr(.x, paste0, "[\\:]?")) %>%
map_chr(paste, collapse = "|") %>%
# Some components may not be present
paste0("(", ., ")?") %>%
# Spaces in between each component
paste(collapse = "[ ]?")
这是正则表达式:
component_regex
#> [1] "(Estimates[\\:]?|Margins of error[\\:]?)?[ ]?(Less than \\$10,000[\\:]?|\\$10,000 to \\$19,999[\\:]?)?[ ]?(Less than 20.0 percent[\\:]?|20.0 to 24.9 percent[\\:]?)?"
现在我们从每个标签中提取组件来创建一个数据框:
data_labels = labels %>%
map(str_match, pattern = component_regex) %>%
map(as.data.frame) %>%
reduce(bind_rows) %>%
select(-V1) %>%
map_df(str_replace, pattern = ":$", replacement = "") %>%
mutate(col_name = names(labels))
# A tibble: 4 x 4
V2 V3 V4 col_name
<chr> <chr> <chr> <chr>
1 Estimates Less than $10,000 Less than 20.0 percent B101
2 Estimates Less than $10,000 20.0 to 24.9 percent B102
3 Estimates $10,000 to $19,999 NA B103
4 Margins of error Less than $10,000 Less than 20.0 percent B104
现在我们转换这个表,以便我们可以加入之前的 components
表并提取新名称。我将首先显示部分结果,以便您可以看到发生了什么:
data_labels %>%
pivot_longer(-col_name, names_to = "id") %>%
# Generate the component id
mutate(id = as.numeric(str_extract_all(id, "[0-9]+")) - 1) %>%
inner_join(components, by = c("id", "value" = "name"))
# A tibble: 11 x 4
col_name id value new_name
<chr> <dbl> <chr> <chr>
1 B101 1 Estimates e
2 B101 2 Less than $10,000 i0to9
3 B101 3 Less than 20.0 percent p0to19
4 B102 1 Estimates e
5 B102 2 Less than $10,000 i0to9
6 B102 3 20.0 to 24.9 percent p20to24
7 B103 1 Estimates e
8 B103 2 $10,000 to $19,999 i10to19
9 B104 1 Margins of error m
10 B104 2 Less than $10,000 i0to9
11 B104 3 Less than 20.0 percent p0to19
请注意,inner_join()
使得数据中省略没有第三个组件的情况。完成方法如下:
new_names = data_labels %>%
pivot_longer(-col_name, names_to = "id") %>%
# Generate the component id
mutate(id = as.numeric(str_extract_all(id, "[0-9]+")) - 1) %>%
inner_join(components, by = c("id", "value" = "name")) %>%
group_by(col_name) %>%
summarise(final_name = paste(new_name[sort(id)], collapse = "_"))
# A tibble: 4 x 2
col_name final_name
<chr> <chr>
1 B101 e_i0to9_p0to19
2 B102 e_i0to9_p20to24
3 B103 e_i10to19
4 B104 m_i0to9_p0to19
我们现在只需用新名称替换名称即可:
old_names = intersect(names(df), new_names$col_name)
df %>%
rename_with(
~ new_names$final_name[which(old_names == .x)],
.cols = all_of(old_names)
)
# A tibble: 1 x 5
id e_i0to9_p0to19 e_i0to9_p20to24 e_i10to19 m_i0to9_p0to19
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 1 2 3 0.1
这个解决方案可能看起来很长,但它有一些好处:
- 映射可以存储在 CSV 文件中并在代码外部进行修改。也就是说,代码实际上并不依赖于您的映射。
- 您可以添加或删除每个组件的部分内容。
- 无论缺少任何组件,它都可以工作。
- 它适用于三个以上的组件。
关于r - 使用从标签派生的缩写以编程方式重命名数据框列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70703733/