r - 从 R 中的文本中提取文本引用(字符串)

标签 r regex text gsub citations

我正在尝试编写一个允许我粘贴书面文本的函数,它会返回写作中使用的文本引用列表。例如,这是我目前拥有的:

pull_cites<- function (text){
gsub("[\\(\\)]", "", regmatches(text, gregexpr("\\(.*?\\)", text))[[1]])
    }
    
pull_cites("This is a test. I only want to select the (cites) in parenthesis. I do not want it to return words in 
    parenthesis that do not have years attached, such as abbreviations (abbr). For example, citing (Smith 2010) is 
    something I would want to be returned. I would also want multiple citations returned separately such as 
    (Smith 2010; Jones 2001; Brown 2020). I would also want Cooper (2015) returned as Cooper 2015, and not just 2015.")

在这个例子中,它返回

[1] "cites"                              "abbr"                               "Smith 2010"                        
[4] "Smith 2010; Jones 2001; Brown 2020" "2015"

但我希望它返回如下内容:

[1] "Smith 2010"
[2] "Smith 2010"                
[3] "Jones 2001"
[4] "Brown 2020"
[5] "Cooper 2015"

关于如何使此功能更具体的任何想法?我正在使用 R。谢谢!

最佳答案

使用一些不那么困难的正则表达式,我们可以执行以下操作:

library(tidyverse)

pull_cites <- function (text) {
  str_extract_all(text, "(?<=\\()[A-Z][a-z][^()]* [12][0-9]{3}(?=\\))|[A-Z][a-z]+ \\([12][0-9]{3}[^()]*", simplify = T) %>% 
    gsub("\\(", "", .) %>% 
    str_split(., "; ") %>% 
    unlist()
}

pull_cites("This is a test. I only want to select the (cites) in parenthesis. 
            I do not want it to return words in parenthesis that do not have years attached, 
            such as abbreviations (abbr). For example, citing (Smith 2010) is something I would 
            want to be returned. I would also want multiple citations returned separately such 
            as (Smith 2010; Jones 2001; Brown 2020). I would also want Cooper (2015) returned 
            as Cooper 2015, and not just 2015. other aspects of life 
            history (Nye et al. 2010; Runge et al. 2010; Lesser 2016). In the Gulf of Maine, 
            annual sea surface temperature (SST) averages have increased a total of roughly 1.6 °C 
            since 1895 (Fernandez et al. 2020)")

[1] "Smith 2010"            "Smith 2010"           
[3] "Jones 2001"            "Brown 2020"           
[5] "Cooper 2015"           "Nye et al. 2010"      
[7] "Runge et al. 2010"     "Lesser 2016"          
[9] "Fernandez et al. 2020"

str_extract_all() 中的正则表达式解释:

  • (?<=\\()匹配左括号后的一个字符 ( (R 中的双转义 \\)
  • [A-Z][a-z][^()]*匹配一个大写字母后跟一个小写字母后跟一个或多个非方括号的字符([^()*] 由@WiktorStribiżew 提供)
  • (?=\\))匹配右括号前的一个字符 )
  • [12][0-9]{3}匹配年份,我假设年份以 1 或 2 开头,然后再跟 3 个数字

下面的正则表达式是用模式 Copper (2015) 来匹配特殊情况:

  • [A-Z][a-z]+ \\([12][0-9]{3}[^()]*匹配任何具有大写字母后跟超过 1 个小写字母后跟空格后跟左括号 ( 的任何内容然后是我在上面定义的“年”

关于r - 从 R 中的文本中提取文本引用(字符串),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/71396883/

相关文章:

r - 按总数排序堆积条形图

regex - 提取大写单词并提取字符串中的最后一个单词

regex - 如何使用正则表达式在powershell中获取字符和字符串之间的子字符串

python - 如何使用 Python 解析复杂的文本文件?

python - 正则表达式:查找特定字符串后的所有数字

java - 逐行读取文本文件并放入对象数组中

PHP : Find repeated words with and without space in text

r - 从 R 中的 Plotly 导出 PNG 文件

rJava 在 .jcall 中给出 NullPointerException

java - 如何从url中过滤txt