regex - 如何使用R提取包含特定人名的句子

我正在使用 R 来提取 包含特定人名的句子 来自文本，这里是一个示例段落:

Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin. Melanchthon became professor of the Greek language in Wittenberg at the age of 21. He studied the Scripture, especially of Paul, and Evangelical doctrine. He was present at the disputation of Leipzig (1519) as a spectator, but participated by his comments. Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium.

在这个简短的段落中，有几个人名，例如:
约翰·鲁奇林、梅兰奇顿、约翰·埃克。在 的帮助下openNLP 包，三个人的名字马丁路德 , 保罗和 Melanchthon 可以正确提取和识别。那么我有两个问题:

我怎么能提取包含这些名字的句子 ?

由于命名实体识别器的输出不是那么有希望，如果我在每个名称中添加“[[]]”，例如[[Johann Reuchlin]]、[[Melanchthon]]，我如何提取包含这些名称表达式的句子 [[A]]、[[B]] ...？

最佳答案

Using `strsplit` and `grep`, first I set made an object `para` which was your paragraph.

toMatch <- c("Martin Luther", "Paul", "Melanchthon")

unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]


> unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                                                                    
[3] " He studied the Scripture, especially of Paul, and Evangelical doctrine"                                                                               
[4] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

或者更清洁一点:

sentences<-unlist(strsplit(para,split="\\."))
sentences[grep(paste(toMatch, collapse="|"),sentences)]

如果您正在寻找每个人所在的句子作为单独的返回值，那么:

toMatch <- c("Martin Luther", "Paul", "Melanchthon")
sentences<-unlist(strsplit(para,split="\\."))
foo<-function(Match){sentences[grep(Match,sentences)]}
lapply(toMatch,foo)

[[1]]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"

[[2]]
[1] " He studied the Scripture, especially of Paul, and Evangelical doctrine"

[[3]]
[1] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                                                   
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

编辑 3:要添加每个人的姓名，请执行一些简单的操作，例如:

foo<-function(Match){c(Match,sentences[grep(Match,sentences)])}

编辑 4:

如果你想找到包含多个人/地点/事物(词)的句子，那么只需为这两个添加一个参数，例如:

toMatch <- c("Martin Luther", "Paul", "Melanchthon","(?=.*Melanchthon)(?=.*Scripture)")

并更改 perl至 TRUE :

foo<-function(Match){c(Match,sentences[grep(Match,sentences,perl = T)])}


> lapply(toMatch,foo)
[[1]]
[1] "Martin Luther"                                                                                                                                         
[2] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"

[[2]]
[1] "Paul"                                                                   
[2] " He studied the Scripture, especially of Paul, and Evangelical doctrine"

[[3]]
[1] "Melanchthon"                                                                                                                          
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                                                   
[3] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

[[4]]
[1] "(?=.*Melanchthon)(?=.*Scripture)"                                                                                                     
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

编辑 5:回答你的另一个问题:

鉴于:

sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]"

gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])

会给你双括号里面的词。

> gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
[1] "Tübingen"        "Wittenberg"      "Martin Luther"   "Johann Reuchlin"

关于regex - 如何使用R提取包含特定人名的句子，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31535154/

regex - 如何使用R提取包含特定人名的句子

上一篇：R 包 sqldf 未加载 tcltk 并返回警告。如何解决？

下一篇：使用 sscanf 控制整个字符串