r - 模糊连接与 R 中的部分字符串匹配

标签 r tidyverse

我遇到了一些我以前从未做过的新事情,我希望得到一些帮助。我正在尝试连接两个数据集(很简单的工作),但其中两列中只有部分字符串匹配。我尝试使用 fuzzy_join,但无法让它为我工作。以下是我正在尝试做的事情。我希望最终得到名为 df_final 的数据框。有什么想法吗?

df1 <- tribble(
  ~student_id, ~course, ~grade, 
  "001", "social studies grade", "A", 
  "001", "ela grade", "A", 
  "001", "math grade", "A", 
  "002", "social studies grade", "B", 
  "002", "ela grade", "B", 
  "002", "math grade", "B", 
  "003", "social studies grade", "C", 
  "003", "ela grade", "C", 
  "003", "math grade", "C", 
  "004", "social studies grade", "C", 
  "004", "ela grade", "C", 
  "004", "math grade", "C", 
  "005", "social studies grade", "C", 
  "005", "ela grade", "C", 
  "005", "math grade", "C", 
)

df2 <- tribble(
  ~student_id, ~course,
  "001", "5th Social Studies",
  "001", "5th ELA",
  "001", "5th Mathematics",
  "002", "6th Social Studies", 
  "002", "6th ELA",
  "002", "6th Mathematics",
  "003", "8th Social Studies",
  "003", "8th ELA",
  "003", "8th Mathematics",
)

df_final <- tribble(
  ~student_id, ~course, ~grade,
  "001", "5th Social Studies", "A",
  "001", "5th ELA", "A",
  "001", "5th Mathematics", "A",
  "002", "6th Social Studies", "B",
  "002", "6th ELA", "B",
  "002", "6th Mathematics", "B",
  "003", "8th Social Studies", "C",
  "003", "8th ELA", "C",
  "003", "8th Mathematics", "C"
)

最佳答案

我们可以使用fuzzyjoin。从两个数据集中的“类(class)”列中获取子字符串后执行 regex_left_join(以使其更匹配)

library(fuzzyjoin)
library(dplyr)
library(stringr)
df2 %>% 
   mutate(grp = toupper(str_remove(course, "^\\d+th\\s+"))) %>% 
   regex_left_join(df1 %>%
       mutate(grp = toupper(str_remove(course, 
     "\\s+grade$")), course = NULL), by = c('student_id', "grp")) %>% 
   select(student_id = student_id.x, course, grade)

-输出

# A tibble: 9 x 3
  student_id course             grade
  <chr>      <chr>              <chr>
1 001        5th Social Studies A    
2 001        5th ELA            A    
3 001        5th Mathematics    A    
4 002        6th Social Studies B    
5 002        6th ELA            B    
6 002        6th Mathematics    B    
7 003        8th Social Studies C    
8 003        8th ELA            C    
9 003        8th Mathematics    C    

OP 的预期输出是

 df_final
# A tibble: 9 x 3
  student_id course             grade
  <chr>      <chr>              <chr>
1 001        5th Social Studies A    
2 001        5th ELA            A    
3 001        5th Mathematics    A    
4 002        6th Social Studies B    
5 002        6th ELA            B    
6 002        6th Mathematics    B    
7 003        8th Social Studies C    
8 003        8th ELA            C    
9 003        8th Mathematics    C    

关于r - 模糊连接与 R 中的部分字符串匹配,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68182139/

相关文章:

r - 机器特定的慢 read_excel 读取时间

r - 如何每 n 行转置一个长数据帧

R markdown to PDF - 打印控制台输出

r - str_extract 仅捕获重复出现的关键字的一个实例

r - 如何避免在 R data.table 的 STOUT 中显示标题?

r - 如何在 ggplot2 中不间断地添加轴标签?

r - 如何将列重命名为变量名 "in a tidyverse way"

r - tidytext——如何做共性和对比词云

r - 如何使用 dplyr 等 Tidyverse 工具跳过顶部标题和底部标题?

r - 如何根据第二个数据集查找多列的平均值?