我正在尝试在 R 中搜索大文本中的关键字。一旦找到一个,我想提取该关键字前后的 1 个句子(包括其中包含该关键字的句子)。理想情况下,我希望能够更改此代码以围绕关键字提取最多 3 个句子。下面是示例数据。
text <- "This is an article about random things. Usually, there are a few sentences that are irrelevant to what I am interested in. Then in the middle, there is a sentence that I want to extract. Water quality is a serious concern in Akron, Ohio. It can impact ecological systems and human health. Jon Doe is a key player in this realm. Then the article goes on talking about something else that I don't care about."
keywords <- c("water quality", "health")
因此,对于上面的文本,我想在文本中搜索“水质”和“健康”,当有匹配时,我想从“然后中间有......”提取到“乔恩” Doe 是这个领域的关键参与者。”
最后,我想在多行上重复此操作,每行都有自己的文本。
我已经研究过使用 stringr/regex 但它没有给我我想要的东西 - 我无法提取完整的句子。有什么想法吗?
我尝试过的代码:
str_extract_all(text,paste0("([^\\s+\\s){5}",keywords,"(\\s[^\\s]+){5}"))
-> 这让我两边都说几句话
gsub(".*?([^\\.]*('water quality'|health)[^\\.]*).*","\\1", text, ignore.case = TRUE)
-> 也关闭
最佳答案
使用关键字
创建要查找的模式,将数据放入小标题中,将它们分成句子(按句点分割)并选择n-1
,对于找到模式的每 n
行,有 n
和 n+1
行。
library(dplyr)
library(tidyr)
keywords <- c("water quality", "health")
pat <- paste0(keywords, collapse = '|')
pat
#[1] "water quality|health"
tibble(text) %>%
separate_rows(text, sep = '\\.\\s*') %>%
slice({
tmp <- grep(pat, text, ignore.case = TRUE)
sort(unique(c(tmp-1, tmp, tmp + 1)))
})
# text
# <chr>
#1 Then in the middle, there is a sentence that I want to extract
#2 Water quality is a serious concern in Akron, Ohio
#3 It can impact ecological systems and human health
#4 Jon Doe is a key player in this realm
关于r - 从文本单元格中提取围绕关键字的多个句子,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66449007/