我遇到了一些我以前从未做过的新事情,我希望得到一些帮助。我正在尝试连接两个数据集(很简单的工作),但其中两列中只有部分字符串匹配。我尝试使用 fuzzy_join,但无法让它为我工作。以下是我正在尝试做的事情。我希望最终得到名为 df_final 的数据框。有什么想法吗?
df1 <- tribble(
~student_id, ~course, ~grade,
"001", "social studies grade", "A",
"001", "ela grade", "A",
"001", "math grade", "A",
"002", "social studies grade", "B",
"002", "ela grade", "B",
"002", "math grade", "B",
"003", "social studies grade", "C",
"003", "ela grade", "C",
"003", "math grade", "C",
"004", "social studies grade", "C",
"004", "ela grade", "C",
"004", "math grade", "C",
"005", "social studies grade", "C",
"005", "ela grade", "C",
"005", "math grade", "C",
)
df2 <- tribble(
~student_id, ~course,
"001", "5th Social Studies",
"001", "5th ELA",
"001", "5th Mathematics",
"002", "6th Social Studies",
"002", "6th ELA",
"002", "6th Mathematics",
"003", "8th Social Studies",
"003", "8th ELA",
"003", "8th Mathematics",
)
df_final <- tribble(
~student_id, ~course, ~grade,
"001", "5th Social Studies", "A",
"001", "5th ELA", "A",
"001", "5th Mathematics", "A",
"002", "6th Social Studies", "B",
"002", "6th ELA", "B",
"002", "6th Mathematics", "B",
"003", "8th Social Studies", "C",
"003", "8th ELA", "C",
"003", "8th Mathematics", "C"
)
最佳答案
我们可以使用fuzzyjoin
。从两个数据集中的“类(class)”列中获取子字符串后执行 regex_left_join
(以使其更匹配)
library(fuzzyjoin)
library(dplyr)
library(stringr)
df2 %>%
mutate(grp = toupper(str_remove(course, "^\\d+th\\s+"))) %>%
regex_left_join(df1 %>%
mutate(grp = toupper(str_remove(course,
"\\s+grade$")), course = NULL), by = c('student_id', "grp")) %>%
select(student_id = student_id.x, course, grade)
-输出
# A tibble: 9 x 3
student_id course grade
<chr> <chr> <chr>
1 001 5th Social Studies A
2 001 5th ELA A
3 001 5th Mathematics A
4 002 6th Social Studies B
5 002 6th ELA B
6 002 6th Mathematics B
7 003 8th Social Studies C
8 003 8th ELA C
9 003 8th Mathematics C
OP 的预期输出是
df_final
# A tibble: 9 x 3
student_id course grade
<chr> <chr> <chr>
1 001 5th Social Studies A
2 001 5th ELA A
3 001 5th Mathematics A
4 002 6th Social Studies B
5 002 6th ELA B
6 002 6th Mathematics B
7 003 8th Social Studies C
8 003 8th ELA C
9 003 8th Mathematics C
关于r - 模糊连接与 R 中的部分字符串匹配,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68182139/