我有两个数据框,其中 x
列可能有拼写错误,而 y
列始终正确。
我不明白为什么用 stringdist
加入多列会得到这些对:
library(dplyr)
library(fuzzyjoin)
a <- data.frame(x = c("season", "season", "season", "package", "package"), y = c("1","2", "3", "1","6"))
b <- data.frame(x = c("season", "seson", "seson", "package", "pakkage"), y = c("1","2", "3", "2","6"))
c <- a %>%
stringdist_left_join(b, by = c("x", "y"), max_dist = c(1,0))
x.x y.x x.y y.y
1 season 1 season 1
2 season 1 seson 2
3 season 1 seson 3
4 season 2 seson 2
5 season 3 season 1
6 season 3 seson 2
7 season 3 seson 3
8 package 1 package 2
9 package 6 <NA> <NA>
我想得到 x.x y.x x.y y.y
1 season 1 season 1
2 season 2 seson 2
3 season 3 seson 3
4 package 1 <NA> <NA>
5 package 6 pakkage 6
最佳答案
我们可以通过基于两个数据集中“x”列中列值的相似性创建一个新列来完成这项工作,然后执行 left_join
library(stringdist)
library(dplyr)
a %>%
mutate(grp = phonetic(x)) %>%
left_join(b %>% mutate(grp = phonetic(x), y2 = y), by = c('grp', 'y')) %>%
select(-grp)
-输出# x.x y x.y y2
#1 season 1 season 1
#2 season 2 seson 2
#3 season 3 seson 3
#4 package 1 <NA> <NA>
#5 package 6 pakkage 6
或其他选项是将
method
中的 stringdist_left_join
从其默认选项( osa
-> Optimal string aligment, (restricted Damerau-Levenshtein distance))更改为 soundex
(距离基于 soundex 编码)library(fuzzyjoin)
a %>%
stringdist_left_join(b, by = c("x", "y"), max_dist = c(1,0),
method = "soundex")
# x.x y.x x.y y.y
#1 season 1 season 1
#2 season 2 seson 2
#3 season 3 seson 3
#4 package 1 <NA> <NA>
#5 package 6 pakkage 6
根据 ?"stringdist-metrics"
For the soundex distance (method='soundex'), strings are translated to a soundex code (see phonetic for a specification). The distance between strings is 0 when they have the same soundex code, otherwise 1. Note that soundex recoding is only meaningful for characters in the ranges a-z and A-Z. A warning is emitted when non-printable or non-ascii characters are encountered.
关于r - 使用 stringdist_join 连接多列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65467009/