问题: 大家好,我有这个示例数据框,其中包含我需要提取的机构名称:
mydf<- data.frame(ID=c('1', '2', '3'), Institution=c('Univ of Space, TX, US', '[Bloggs, J., Smith, T.] Univ of Time, CA, US', '[Windz, P., Lol, D.] College of the World, CA, US' ))
我只需要提取机构名称,使其显示如下:
1 Univ of Space
2 Univ of Time
3 College of the World
我不关心机构字符串中的任何其他字符,只关心第一个逗号之前的所有字符。问题是我在某些情况下机构名称前面会加一个括号,有时会单独出现(如第一行的情况)。
我编写了以下代码来分别提取这两个实例:
ex_inst<- str_extract_all(mydf$Institution,"(?<=])(.+?)(?=,)", simplify = TRUE)
ex_inst2<- str_extract_all(mydf$Institution,"^(.+?)(?=,)", simplify = TRUE)
我正在努力将它们结合在一起。我研究了交替,并尝试了这个
ex_inst3<- str_extract_all(mydf$Institution,"^(.+?)(?=,)|(?<=])(.+?)(?=,)", simplify = TRUE)
但我对正则表达式没有经验,并且对它的输出内容感到困惑:
[1,] "Univ of Space" ""
[2,] "[Bloggs" " Univ of Time"
[3,] "[Windz" " College of the World"
将其与 stringr 结合的最佳方法是什么,我可以使用某种 if else 语句吗? 谢谢。
最佳答案
我们可以使用str_replace
来捕获不是,
的字符,同时删除方括号内(包括方括号)内的任何字符
library(stringr)
str_replace(mydf$Institution,"^(\\[[^]]*\\]\\s*)?([^,]+),.*", "\\2")
#[1] "Univ of Space" "Univ of Time" "College of the World"
或者使用与 sub
相同的模式
sub("^(\\[[^]]*\\]\\s*)?([^,]+),.*", "\\2", mydf$Institution)
#[1] "Univ of Space" "Univ of Time" "College of the World"
更新
如果我们有多个条目,可以选择将其拆分为多个组件,然后使用与上面相同的代码
v1 <- unlist(strsplit(as.character(mydf$Institution), ";\\s(?=\\[)", perl = TRUE))
sub("^(\\[[^]]*\\]\\s*)?([^,]+),.*", "\\2", v1)
#[1] "Univ of Space" "Univ of Time"
#[3] "College of the World" "College of the World" "Space Institute"
如果我们需要附加为新列
lst1 <- setNames(strsplit(as.character(mydf$Institution),
";\\s(?=\\[)", perl = TRUE), mydf$ID)
mydf2 <- stack(lst1)
mydf2$values <- sub("^(\\[[^]]*\\]\\s*)?([^,]+),.*", "\\2", stack(lst1)$values)
out1 <- aggregate(values ~ ., merge(mydf, mydf2, by.x = "ID", by.y = "ind"),
FUN = paste, collapse = '; ')
out1[order(as.numeric(out1$ID)),]
# ID Institution values
#4 1 Univ of Space, TX, US Univ of Space
#1 2 [Bloggs, J., Smith, T.] Univ of Time, CA, US Univ of Time
#2 3 [Windz, P., Lol, D.] College of the World, CA, US College of the World
#3 4 [Windz, P., Lol, D.] College of the World, CA, US; [Bon, D.; Wilson, M.] Space Institute, TX, US College of the World; Space Institute
数据
mydf <- structure(list(ID = c("1", "2", "3", "4"), Institution = c("Univ of Space, TX, US",
"[Bloggs, J., Smith, T.] Univ of Time, CA, US", "[Windz, P., Lol, D.] College of the World, CA, US",
"[Windz, P., Lol, D.] College of the World, CA, US; [Bon, D.; Wilson, M.] Space Institute, TX, US"
)), class = "data.frame", row.names = c(NA, -4L))
注意:更新了评论中的数据
关于r - 如何使用 stringr 将两个正则表达式与 if else 结合起来,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59295825/