r - R 中两个数据集之间的近似字符串匹配

我有以下数据集，其中包含电影标题和相应的类型，而另一个数据集包含纯文本，其中可能引用或不引用这些标题:

dt1

   title                                        genre

   Secret in Their Eyes                         Dramas
   V for Vendetta                               Action & Adventure
   Bottersnikes & Gumbles                       Kids' TV
   ...                                          ...

和

dt2

id      Text
1.      "I really liked V for Vendetta"
2       "Bottersnikes & Gumbles was a great film .... "
3.      " In any case, in my opinion bottersnikes &gumbles was a great film ..."
4       "@thewitcher was an interesting series
5       "Secret in Their Eye is a terrible film! but I Like V per Vendetta" 
... etc

我想要获得的是一个与 dt1 中的标题匹配的函数，并尝试在 dt2 的文本中找到它们:

如果它找到任何匹配或近似匹配，我希望在 dt2 中有一列来说明文本中提到的标题。如果提到多个标题，我想要一个用逗号分隔的任何标题。

dt2

id      Text                                                                       mentions
1.      "I really liked V for Vendetta"                                            "V for Vendetta"
2       "Bottersnikes & Gumbles was a great film .... "                            "Bottersnikes & Gumbles"
3.      " In any case, in my opinion bottersnikes &gumbles was a great film ..."   "Bottersnikes & Gumbles"
4       "@thewitcher was an interesting series                                       NA
5       "Secret in Their Eye is a terrible film! but I Like V per Vendetta"          "Secret in Their Eyes, V for Vendetta" 
... etc

最佳答案

您可以通过agrep()进行模糊匹配，这里我用lapply()对每个标题使用它来为每个标题生成匹配的逻辑向量文本，然后在该匹配的 data.frame 上使用 apply() 来创建匹配标题的向量。

您可以调整 max.distance 值，但这在您的示例中效果很好。

dt1 <- data.frame(
  title = c("Secret in Their Eyes", "V for Vendetta", "Bottersnikes & Gumbles"),
  genre = c("Dramas", "Action & Adventure", "Kids' TV"),
  stringsAsFactors = FALSE
)

dt2 <- data.frame(
  id = 1:5,
  Text = c(
    "I really liked V for Vendetta",
    "Bottersnikes & Gumbles was a great film .... ",
    "In any case, in my opinion bottersnikes &gumbles was a great film ...",
    "@thewitcher was an interesting series",
    "Secret in Their Eye is a terrible film! but I Like V per Vendetta"
  ),
  stringsAsFactors = FALSE
)

match_titles <- function(target, titles) {
  matches <- lapply(titles, agrepl, target,
    max.distance = 0.3,
    ignore.case = TRUE, fixed = TRUE
  )
  matched_titles <- apply(
    data.frame(matches), 1,
    function(y) paste(titles[y], collapse = ",")
  )
  matched_titles
}

dt2$titles <- match_titles(dt2$Text, dt1$title)
dt2
##   id                                                                  Text
## 1  1                                         I really liked V for Vendetta
## 2  2                         Bottersnikes & Gumbles was a great film .... 
## 3  3 In any case, in my opinion bottersnikes &gumbles was a great film ...
## 4  4                                 @thewitcher was an interesting series
## 5  5     Secret in Their Eye is a terrible film! but I Like V per Vendetta
##                                titles
## 1                      V for Vendetta
## 2              Bottersnikes & Gumbles
## 3              Bottersnikes & Gumbles
## 4                                    
## 5 Secret in Their Eyes,V for Vendetta

关于r - R 中两个数据集之间的近似字符串匹配，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/61269395/

r - R 中两个数据集之间的近似字符串匹配

上一篇：r - 如何根据相等的长度拆分R中的字符串列并将它们放在不同的行中

下一篇：android - 在没有 Samsung Cloud 帐户的情况下迁移 Samsung Health 数据