r - 模糊连接与 R 中的部分字符串匹配

我遇到了一些我以前从未做过的新事情，我希望得到一些帮助。我正在尝试连接两个数据集(很简单的工作)，但其中两列中只有部分字符串匹配。我尝试使用 fuzzy_join，但无法让它为我工作。以下是我正在尝试做的事情。我希望最终得到名为 df_final 的数据框。有什么想法吗？

df1 <- tribble(
  ~student_id, ~course, ~grade, 
  "001", "social studies grade", "A", 
  "001", "ela grade", "A", 
  "001", "math grade", "A", 
  "002", "social studies grade", "B", 
  "002", "ela grade", "B", 
  "002", "math grade", "B", 
  "003", "social studies grade", "C", 
  "003", "ela grade", "C", 
  "003", "math grade", "C", 
  "004", "social studies grade", "C", 
  "004", "ela grade", "C", 
  "004", "math grade", "C", 
  "005", "social studies grade", "C", 
  "005", "ela grade", "C", 
  "005", "math grade", "C", 
)

df2 <- tribble(
  ~student_id, ~course,
  "001", "5th Social Studies",
  "001", "5th ELA",
  "001", "5th Mathematics",
  "002", "6th Social Studies", 
  "002", "6th ELA",
  "002", "6th Mathematics",
  "003", "8th Social Studies",
  "003", "8th ELA",
  "003", "8th Mathematics",
)

df_final <- tribble(
  ~student_id, ~course, ~grade,
  "001", "5th Social Studies", "A",
  "001", "5th ELA", "A",
  "001", "5th Mathematics", "A",
  "002", "6th Social Studies", "B",
  "002", "6th ELA", "B",
  "002", "6th Mathematics", "B",
  "003", "8th Social Studies", "C",
  "003", "8th ELA", "C",
  "003", "8th Mathematics", "C"
)

最佳答案

我们可以使用fuzzyjoin。从两个数据集中的“类(class)”列中获取子字符串后执行 regex_left_join(以使其更匹配)

library(fuzzyjoin)
library(dplyr)
library(stringr)
df2 %>% 
   mutate(grp = toupper(str_remove(course, "^\\d+th\\s+"))) %>% 
   regex_left_join(df1 %>%
       mutate(grp = toupper(str_remove(course, 
     "\\s+grade$")), course = NULL), by = c('student_id', "grp")) %>% 
   select(student_id = student_id.x, course, grade)

-输出

# A tibble: 9 x 3
  student_id course             grade
  <chr>      <chr>              <chr>
1 001        5th Social Studies A    
2 001        5th ELA            A    
3 001        5th Mathematics    A    
4 002        6th Social Studies B    
5 002        6th ELA            B    
6 002        6th Mathematics    B    
7 003        8th Social Studies C    
8 003        8th ELA            C    
9 003        8th Mathematics    C

OP 的预期输出是

 df_final
# A tibble: 9 x 3
  student_id course             grade
  <chr>      <chr>              <chr>
1 001        5th Social Studies A    
2 001        5th ELA            A    
3 001        5th Mathematics    A    
4 002        6th Social Studies B    
5 002        6th ELA            B    
6 002        6th Mathematics    B    
7 003        8th Social Studies C    
8 003        8th ELA            C    
9 003        8th Mathematics    C

关于r - 模糊连接与 R 中的部分字符串匹配，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/68182139/

r - 模糊连接与 R 中的部分字符串匹配

上一篇：python - 按下鼠标后如何在 Pyglet 中的图像上绘制形状？

下一篇：python - Sphinx 自动摘要生成表中的自动换行