我有以下数据集,其中包含电影标题和相应的类型,而另一个数据集包含纯文本,其中可能引用或不引用这些标题:
dt1
title genre
Secret in Their Eyes Dramas
V for Vendetta Action & Adventure
Bottersnikes & Gumbles Kids' TV
... ...
和
dt2
id Text
1. "I really liked V for Vendetta"
2 "Bottersnikes & Gumbles was a great film .... "
3. " In any case, in my opinion bottersnikes &gumbles was a great film ..."
4 "@thewitcher was an interesting series
5 "Secret in Their Eye is a terrible film! but I Like V per Vendetta"
... etc
我想要获得的是一个与 dt1 中的标题匹配的函数,并尝试在 dt2 的文本中找到它们:
如果它找到任何匹配或近似匹配,我希望在 dt2 中有一列来说明文本中提到的标题。如果提到多个标题,我想要一个用逗号分隔的任何标题。
dt2
id Text mentions
1. "I really liked V for Vendetta" "V for Vendetta"
2 "Bottersnikes & Gumbles was a great film .... " "Bottersnikes & Gumbles"
3. " In any case, in my opinion bottersnikes &gumbles was a great film ..." "Bottersnikes & Gumbles"
4 "@thewitcher was an interesting series NA
5 "Secret in Their Eye is a terrible film! but I Like V per Vendetta" "Secret in Their Eyes, V for Vendetta"
... etc
最佳答案
您可以通过agrep()
进行模糊匹配,这里我用lapply()
对每个标题使用它来为每个标题生成匹配的逻辑向量文本,然后在该匹配的 data.frame 上使用 apply()
来创建匹配标题的向量。
您可以调整 max.distance
值,但这在您的示例中效果很好。
dt1 <- data.frame(
title = c("Secret in Their Eyes", "V for Vendetta", "Bottersnikes & Gumbles"),
genre = c("Dramas", "Action & Adventure", "Kids' TV"),
stringsAsFactors = FALSE
)
dt2 <- data.frame(
id = 1:5,
Text = c(
"I really liked V for Vendetta",
"Bottersnikes & Gumbles was a great film .... ",
"In any case, in my opinion bottersnikes &gumbles was a great film ...",
"@thewitcher was an interesting series",
"Secret in Their Eye is a terrible film! but I Like V per Vendetta"
),
stringsAsFactors = FALSE
)
match_titles <- function(target, titles) {
matches <- lapply(titles, agrepl, target,
max.distance = 0.3,
ignore.case = TRUE, fixed = TRUE
)
matched_titles <- apply(
data.frame(matches), 1,
function(y) paste(titles[y], collapse = ",")
)
matched_titles
}
dt2$titles <- match_titles(dt2$Text, dt1$title)
dt2
## id Text
## 1 1 I really liked V for Vendetta
## 2 2 Bottersnikes & Gumbles was a great film ....
## 3 3 In any case, in my opinion bottersnikes &gumbles was a great film ...
## 4 4 @thewitcher was an interesting series
## 5 5 Secret in Their Eye is a terrible film! but I Like V per Vendetta
## titles
## 1 V for Vendetta
## 2 Bottersnikes & Gumbles
## 3 Bottersnikes & Gumbles
## 4
## 5 Secret in Their Eyes,V for Vendetta
关于r - R 中两个数据集之间的近似字符串匹配,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61269395/