R模糊字符串匹配根据匹配的字符串返回特定列

标签 r merge data.table string-matching stringdist

我有两个大型数据集,一个大约有 50 万条记录,另一个大约有 70K。这些数据集有地址。我想匹配较小数据集中的任何地址是否存在于大数据集中。正如您所想象的那样,地址可以用不同的方式和不同的情况/拼写等来书写。此外,如果只写到建筑物级别,则该地址可以重复。所以不同的公寓有相同的地址。我做了一些研究并找出了可以使用的包 stringdist 。

我做了一些工作并设法根据距离获得最接近的匹配。但是我无法返回地址匹配的相应列。

下面是一个示例虚拟数据以及我为解释情况而创建的代码

library(stringdist)
Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr")
Year1 <- c(2001:2007)

Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)

df1 <- data.table(Address1,Year1)
df2 <- data.table(Address2,Year2)
df2[,unique_id := sprintf("%06d", 1:nrow(df2))]

fn_match = function(str, strVec, n){
  strVec[amatch(str, strVec, method = "dl", maxDist=n,useBytes = T)]
}

df1[!is.na(Address1)
    , address_match := 
      fn_match(Address1, df2$Address2,3)
    ]

这会返回基于距离 3 的闭合字符串匹配,但是我还想在 df1 中包含来自 df2 的“Year”和“unique_id”列。这将帮助我了解该字符串与 df2 中的哪一行数据相匹配。所以最后我想知道df1中的每一行根据指定的距离,df2的最接近的匹配是什么,并且对于匹配的行有特定的“年份”来自 df2 的“unique_id”

我猜想与合并(左连接)有关,但我不确定如何合并保留重复项并确保与 df1(小数据集)中的行数相同。

任何类型的解决方案都会有所帮助!!

最佳答案

你已经完成了 90%...

你说你想

know with which row of data the string was matched from df2

您只需要了解已有的代码即可。请参阅?amatch:

amatch returns the position of the closest match of x in table. When multiple matches with the same smallest distance metric exist, the first one is returned.

换句话说,amatch 为您提供了 df2(这是您的)中最接近匹配的行的索引df1 中的每个地址(即您的 x)。您通过返回新地址来过早地包装该索引。

相反,检索索引本身以进行查找左连接的 unique_id(如果您确信它确实是唯一 ID)。

两种方法的说明:

library(data.table) # you forgot this in your example
library(stringdist)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
                  Year1 = 2001:2007) # already a vector, no need to combine
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
                  Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)] # use .I, it's neater

# Return position from strVec of closest match to str
match_pos = function(str, strVec, n){
  amatch(str, strVec, method = "dl", maxDist=n,useBytes = T) # are you sure you want useBytes = TRUE?
}

# Option 1: use unique_id as a key for left join
df1[!is.na(Address1) | nchar(Address1>0), # I would exclude only on NA_character_ but also empty string, perhaps string of length < 3
    unique_id := df2$unique_id[match_pos(Address1, df2$Address2,3)] ]
merge(df1, df2, by='unique_id', all.x=TRUE) # see ?merge for more options

# Option 2: use the row index
df1[!is.na(Address1) | nchar(Address1>0),
    df2_pos := match_pos(Address1, df2$Address2,3) ] 
df1[!is.na(df2_pos), (c('Address2','Year2','UniqueID')):=df2[df2_pos,.(Address2,Year2,unique_id)] ][]

关于R模糊字符串匹配根据匹配的字符串返回特定列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42749447/

相关文章:

r - 调用 aictab 时未定义函数

c# - SQL Server批量插入/更新与插入或更新场景中的MERGE

Java - 在 Eclipse 中合并两个类

php - 用 PHP 合并两个图像

r - 和 grepl 一起吃

r - 使用 data.table 查找重叠间隔组

r - 解析出字符串,将其设置为 R data.table 中的因子列

r - 如何在 R 中将 (i in nums) 与 foreach 一起使用?

r - ggplot2:具有点和填充分离的箱线图

r - 当 X 值相同时,在分面网格中的两个图形上强制 X 轴