r - R 中两个数据集之间的近似字符串匹配

标签 r string-matching tm quanteda

我有以下数据集,其中包含电影标题和相应的类型,而另一个数据集包含纯文本,其中可能引用或不引用这些标题:

dt1

   title                                        genre

   Secret in Their Eyes                         Dramas
   V for Vendetta                               Action & Adventure
   Bottersnikes & Gumbles                       Kids' TV
   ...                                          ...

dt2

id      Text
1.      "I really liked V for Vendetta"
2       "Bottersnikes & Gumbles was a great film .... "
3.      " In any case, in my opinion bottersnikes &gumbles was a great film ..."
4       "@thewitcher was an interesting series
5       "Secret in Their Eye is a terrible film! but I Like V per Vendetta" 
... etc

我想要获得的是一个与 dt1 中的标题匹配的函数,并尝试在 dt2 的文本中找到它们:

如果它找到任何匹配或近似匹配,我希望在 dt2 中有一列来说明文本中提到的标题。如果提到多个标题,我想要一个用逗号分隔的任何标题。

dt2

id      Text                                                                       mentions
1.      "I really liked V for Vendetta"                                            "V for Vendetta"
2       "Bottersnikes & Gumbles was a great film .... "                            "Bottersnikes & Gumbles"
3.      " In any case, in my opinion bottersnikes &gumbles was a great film ..."   "Bottersnikes & Gumbles"
4       "@thewitcher was an interesting series                                       NA
5       "Secret in Their Eye is a terrible film! but I Like V per Vendetta"          "Secret in Their Eyes, V for Vendetta" 
... etc

最佳答案

您可以通过agrep()进行模糊匹配,这里我用lapply()对每个标题使用它来为每个标题生成匹配的逻辑向量文本,然后在该匹配的 data.frame 上使用 apply() 来创建匹配标题的向量。

您可以调整 max.distance 值,但这在您的示例中效果很好。

dt1 <- data.frame(
  title = c("Secret in Their Eyes", "V for Vendetta", "Bottersnikes & Gumbles"),
  genre = c("Dramas", "Action & Adventure", "Kids' TV"),
  stringsAsFactors = FALSE
)

dt2 <- data.frame(
  id = 1:5,
  Text = c(
    "I really liked V for Vendetta",
    "Bottersnikes & Gumbles was a great film .... ",
    "In any case, in my opinion bottersnikes &gumbles was a great film ...",
    "@thewitcher was an interesting series",
    "Secret in Their Eye is a terrible film! but I Like V per Vendetta"
  ),
  stringsAsFactors = FALSE
)

match_titles <- function(target, titles) {
  matches <- lapply(titles, agrepl, target,
    max.distance = 0.3,
    ignore.case = TRUE, fixed = TRUE
  )
  matched_titles <- apply(
    data.frame(matches), 1,
    function(y) paste(titles[y], collapse = ",")
  )
  matched_titles
}

dt2$titles <- match_titles(dt2$Text, dt1$title)
dt2
##   id                                                                  Text
## 1  1                                         I really liked V for Vendetta
## 2  2                         Bottersnikes & Gumbles was a great film .... 
## 3  3 In any case, in my opinion bottersnikes &gumbles was a great film ...
## 4  4                                 @thewitcher was an interesting series
## 5  5     Secret in Their Eye is a terrible film! but I Like V per Vendetta
##                                titles
## 1                      V for Vendetta
## 2              Bottersnikes & Gumbles
## 3              Bottersnikes & Gumbles
## 4                                    
## 5 Secret in Their Eyes,V for Vendetta

关于r - R 中两个数据集之间的近似字符串匹配,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61269395/

相关文章:

string - 名称匹配的字符串相似性

c - 在 C 中使用 tm 时出现段错误

r - "replace"函数示例

python - 在 Python 中一次遍历字符串单词

无法使用给定的 CA 证书对 R 和对等证书进行身份验证

string - 名称的近似字符串匹配算法

r - tm 包中不再支持 Dictionary()。如何修改代码?

r - 如何设置术语频率绑定(bind)以提取新术语文档矩阵?

r - 如何删除每行字符串中的重复项

R 解析没有前导零的 %m%d%Y 形式的时间戳