r - 从文本单元格中提取围绕关键字的多个句子

我正在尝试在 R 中搜索大文本中的关键字。一旦找到一个，我想提取该关键字前后的 1 个句子(包括其中包含该关键字的句子)。理想情况下，我希望能够更改此代码以围绕关键字提取最多 3 个句子。下面是示例数据。

text <- "This is an article about random things. Usually, there are a few sentences that are irrelevant to what I am interested in. Then in the middle, there is a sentence that I want to extract. Water quality is a serious concern in Akron, Ohio. It can impact ecological systems and human health. Jon Doe is a key player in this realm. Then the article goes on talking about something else that I don't care about."

keywords <- c("water quality", "health")

因此，对于上面的文本，我想在文本中搜索“水质”和“健康”，当有匹配时，我想从“然后中间有......”提取到“乔恩” Doe 是这个领域的关键参与者。”

最后，我想在多行上重复此操作，每行都有自己的文本。

我已经研究过使用 stringr/regex 但它没有给我我想要的东西 - 我无法提取完整的句子。有什么想法吗？

我尝试过的代码:

str_extract_all(text,paste0("([^\\s+\\s){5}",keywords,"(\\s[^\\s]+){5}"))

-> 这让我两边都说几句话

gsub(".*?([^\\.]*('water quality'|health)[^\\.]*).*","\\1", text, ignore.case = TRUE)

-> 也关闭

最佳答案

使用关键字创建要查找的模式，将数据放入小标题中，将它们分成句子(按句点分割)并选择n-1，对于找到模式的每 n 行，有 n 和 n+1 行。

library(dplyr)
library(tidyr)

keywords <- c("water quality", "health")
pat <- paste0(keywords, collapse = '|')
pat
#[1] "water quality|health"

tibble(text) %>%
  separate_rows(text, sep = '\\.\\s*') %>%
  slice({
    tmp <- grep(pat, text, ignore.case = TRUE)
    sort(unique(c(tmp-1, tmp, tmp + 1)))
  })

#  text                                                          
#  <chr>                                                         
#1 Then in the middle, there is a sentence that I want to extract
#2 Water quality is a serious concern in Akron, Ohio             
#3 It can impact ecological systems and human health             
#4 Jon Doe is a key player in this realm

关于r - 从文本单元格中提取围绕关键字的多个句子，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/66449007/

r - 从文本单元格中提取围绕关键字的多个句子

上一篇：java - 尝试在 Quarkus 中进行 REST 调用时出错

下一篇：python - hist2d 中的 vmin 和 vmax