regex - 如何使用R提取包含特定人名的句子

标签 regex r tm opennlp

我正在使用 R 来提取 包含特定人名的句子 来自文本,这里是一个示例段落:

Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin. Melanchthon became professor of the Greek language in Wittenberg at the age of 21. He studied the Scripture, especially of Paul, and Evangelical doctrine. He was present at the disputation of Leipzig (1519) as a spectator, but participated by his comments. Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium.



在这个简短的段落中,有几个人名,例如:
约翰·鲁奇林、梅兰奇顿、约翰·埃克。在 的帮助下openNLP 包,三个人的名字马丁路德 , 保罗 Melanchthon 可以正确提取和识别。那么我有两个问题:
  • 我怎么能提取包含这些名字的句子 ?
  • 由于命名实体识别器的输出不是那么有希望,如果我在每个名称中添加“[[]]”,例如[[Johann Reuchlin]]、[[Melanchthon]],我如何提取包含这些名称表达式的句子 [[A]]、[[B]] ...?
  • 最佳答案

    Using `strsplit` and `grep`, first I set made an object `para` which was your paragraph.
    
    toMatch <- c("Martin Luther", "Paul", "Melanchthon")
    
    unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]
    
    
    > unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]
    [1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
    [2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                                                                    
    [3] " He studied the Scripture, especially of Paul, and Evangelical doctrine"                                                                               
    [4] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"    
    

    或者更清洁一点:
    sentences<-unlist(strsplit(para,split="\\."))
    sentences[grep(paste(toMatch, collapse="|"),sentences)]
    

    如果您正在寻找每个人所在的句子作为单独的返回值,那么:
    toMatch <- c("Martin Luther", "Paul", "Melanchthon")
    sentences<-unlist(strsplit(para,split="\\."))
    foo<-function(Match){sentences[grep(Match,sentences)]}
    lapply(toMatch,foo)
    
    [[1]]
    [1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
    
    [[2]]
    [1] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
    
    [[3]]
    [1] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                                                   
    [2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
    

    编辑 3:要添加每个人的姓名,请执行一些简单的操作,例如:
    foo<-function(Match){c(Match,sentences[grep(Match,sentences)])}
    

    编辑 4:

    如果你想找到包含多个人/地点/事物(词)的句子,那么只需为这两个添加一个参数,例如:
    toMatch <- c("Martin Luther", "Paul", "Melanchthon","(?=.*Melanchthon)(?=.*Scripture)")
    

    并更改 perlTRUE :
    foo<-function(Match){c(Match,sentences[grep(Match,sentences,perl = T)])}
    
    
    > lapply(toMatch,foo)
    [[1]]
    [1] "Martin Luther"                                                                                                                                         
    [2] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
    
    [[2]]
    [1] "Paul"                                                                   
    [2] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
    
    [[3]]
    [1] "Melanchthon"                                                                                                                          
    [2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                                                   
    [3] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
    
    [[4]]
    [1] "(?=.*Melanchthon)(?=.*Scripture)"                                                                                                     
    [2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
    

    编辑 5:回答你的另一个问题:

    鉴于:
    sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]"
    
    gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
    

    会给你双括号里面的词。
    > gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
    [1] "Tübingen"        "Wittenberg"      "Martin Luther"   "Johann Reuchlin"
    

    关于regex - 如何使用R提取包含特定人名的句子,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31535154/

    相关文章:

    html - 我想在 HTML 标记中的正则表达式中添加引号

    javascript - 这个正则表达式的含义

    r - 按 R 中的列组长度对数据帧进行排序

    r - 从相似字符串的向量中获取唯一字符串

    r - Snowball Stemmer 只词干最后一个词

    r - 组合 tm R 中的单词未达到预期结果

    c# - 正则表达式匹配可选组

    Python re.sub 反向引用而不是反向引用

    r - 如何计算唯一条目而不是总和值(环境数据集)

    c++ - 结构tm时间; vs tm 时间 = {}。输出相同但不一样?