r - 从 R 中的文本中提取文本引用(字符串)

我正在尝试编写一个允许我粘贴书面文本的函数，它会返回写作中使用的文本引用列表。例如，这是我目前拥有的:

pull_cites<- function (text){
gsub("[\\(\\)]", "", regmatches(text, gregexpr("\\(.*?\\)", text))[[1]])
    }
    
pull_cites("This is a test. I only want to select the (cites) in parenthesis. I do not want it to return words in 
    parenthesis that do not have years attached, such as abbreviations (abbr). For example, citing (Smith 2010) is 
    something I would want to be returned. I would also want multiple citations returned separately such as 
    (Smith 2010; Jones 2001; Brown 2020). I would also want Cooper (2015) returned as Cooper 2015, and not just 2015.")

在这个例子中，它返回

[1] "cites"                              "abbr"                               "Smith 2010"                        
[4] "Smith 2010; Jones 2001; Brown 2020" "2015"

但我希望它返回如下内容:

[1] "Smith 2010"
[2] "Smith 2010"                
[3] "Jones 2001"
[4] "Brown 2020"
[5] "Cooper 2015"

关于如何使此功能更具体的任何想法？我正在使用 R。谢谢!

最佳答案

使用一些不那么困难的正则表达式，我们可以执行以下操作:

library(tidyverse)

pull_cites <- function (text) {
  str_extract_all(text, "(?<=\\()[A-Z][a-z][^()]* [12][0-9]{3}(?=\\))|[A-Z][a-z]+ \\([12][0-9]{3}[^()]*", simplify = T) %>% 
    gsub("\\(", "", .) %>% 
    str_split(., "; ") %>% 
    unlist()
}

pull_cites("This is a test. I only want to select the (cites) in parenthesis. 
            I do not want it to return words in parenthesis that do not have years attached, 
            such as abbreviations (abbr). For example, citing (Smith 2010) is something I would 
            want to be returned. I would also want multiple citations returned separately such 
            as (Smith 2010; Jones 2001; Brown 2020). I would also want Cooper (2015) returned 
            as Cooper 2015, and not just 2015. other aspects of life 
            history (Nye et al. 2010; Runge et al. 2010; Lesser 2016). In the Gulf of Maine, 
            annual sea surface temperature (SST) averages have increased a total of roughly 1.6 °C 
            since 1895 (Fernandez et al. 2020)")

[1] "Smith 2010"            "Smith 2010"           
[3] "Jones 2001"            "Brown 2020"           
[5] "Cooper 2015"           "Nye et al. 2010"      
[7] "Runge et al. 2010"     "Lesser 2016"          
[9] "Fernandez et al. 2020"

str_extract_all() 中的正则表达式解释:

(?<=\\()匹配左括号后的一个字符 ( (R 中的双转义 \\)
[A-Z][a-z][^()]*匹配一个大写字母后跟一个小写字母后跟一个或多个非方括号的字符([^()*] 由@WiktorStribiżew 提供)
(?=\\))匹配右括号前的一个字符 )
[12][0-9]{3}匹配年份，我假设年份以 1 或 2 开头，然后再跟 3 个数字

下面的正则表达式是用模式 Copper (2015) 来匹配特殊情况:

[A-Z][a-z]+ \\([12][0-9]{3}[^()]*匹配任何具有大写字母后跟超过 1 个小写字母后跟空格后跟左括号 ( 的任何内容然后是我在上面定义的“年”

关于r - 从 R 中的文本中提取文本引用(字符串)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/71396883/

r - 从 R 中的文本中提取文本引用(字符串)

上一篇：python - 如何将 Pandas 日期时间列从 UTC 转换为 EST

下一篇：arrays - 根据此数组中的元素对数组进行排序