r - 使用 stringdist_join 连接多列

标签 r join left-join

我有两个数据框,其中 x 列可能有拼写错误,而 y 列始终正确。
我不明白为什么用 stringdist 加入多列会得到这些对:

library(dplyr)
library(fuzzyjoin)
a <- data.frame(x = c("season", "season", "season", "package", "package"), y = c("1","2", "3", "1","6"))

b <- data.frame(x = c("season", "seson", "seson", "package", "pakkage"), y = c("1","2", "3", "2","6"))

c <- a %>%
  stringdist_left_join(b, by = c("x", "y"), max_dist = c(1,0))

      x.x y.x     x.y  y.y
1  season   1  season    1
2  season   1   seson    2
3  season   1   seson    3
4  season   2   seson    2
5  season   3  season    1
6  season   3   seson    2
7  season   3   seson    3
8 package   1 package    2
9 package   6    <NA> <NA>
我想得到
      x.x y.x     x.y  y.y
1  season   1  season    1
2  season   2   seson    2
3  season   3   seson    3
4 package   1    <NA> <NA>
5 package   6 pakkage    6

最佳答案

我们可以通过基于两个数据集中“x”列中列值的相似性创建一个新列来完成这项工作,然后执行 left_join

library(stringdist)
library(dplyr)
a %>%
    mutate(grp = phonetic(x)) %>%
   left_join(b %>% mutate(grp = phonetic(x), y2 = y), by = c('grp', 'y')) %>% 
   select(-grp)
-输出
#      x.x y     x.y   y2
#1  season 1  season    1
#2  season 2   seson    2
#3  season 3   seson    3
#4 package 1    <NA> <NA>
#5 package 6 pakkage    6

或其他选项是将 method 中的 stringdist_left_join 从其默认选项( osa -> Optimal string aligment, (restricted Damerau-Levenshtein distance))更改为 soundex(距离基于 soundex 编码)
library(fuzzyjoin)
a %>%
   stringdist_left_join(b, by = c("x", "y"), max_dist = c(1,0), 
            method = "soundex")
#      x.x y.x     x.y  y.y
#1  season   1  season    1
#2  season   2   seson    2
#3  season   3   seson    3
#4 package   1    <NA> <NA>
#5 package   6 pakkage    6
根据 ?"stringdist-metrics"

For the soundex distance (method='soundex'), strings are translated to a soundex code (see phonetic for a specification). The distance between strings is 0 when they have the same soundex code, otherwise 1. Note that soundex recoding is only meaningful for characters in the ranges a-z and A-Z. A warning is emitted when non-printable or non-ascii characters are encountered.

关于r - 使用 stringdist_join 连接多列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65467009/

相关文章:

R:图例的位置和图例中线条的长度

python - string.join 使用元组或列表更快吗?

sql - 使用具有多个连接的 SQL 聚合函数

unix - 根据共同的 2 列正确连接两个文件

mysql - 从两个表中选择带有左连接计数的计数

sql - 通过连接表进行排序和计数的 SELECT

r - Windows 上的 RStudio 上无法读取消息

r - ggplot2:如何向圆环图添加百分比标签

excel - 条件格式 : making cells colorful

php - MySQL - 复杂查询问题,多个表