regex - R:strsplit 中的正则表达式(查找 ", "后跟大写字母)

标签 regex r strsplit

假设我有一个向量,其中包含一些我想根据正则表达式分割的字符。

更准确地说,我想根据逗号、空格和大写字母来分割字符串(据我所知,regex 命令如下所示:/(, [A-Z])/g (当我尝试时效果很好 here ))。

当我尝试在 r 中实现此目的时,regex 似乎不起作用,例如:

x <- c("Non MMF investment funds, Insurance corporations, Assets (Net Acquisition of), Loans, Long-term original maturity (over 1 year or no stated maturity)",
  "Non financial corporations, Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds, Assets (Net Acquisition of), Loans, Short-term original maturity (up to 1 year)")

strsplit(x, "/(, [A-Z])/g")
[[1]]
[1] "Non MMF investment funds, Insurance corporations, Assets (Net Acquisition of), Loans, Long-term original maturity (over 1 year or no stated maturity)"

[[2]]
[1] "Non financial corporations, Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds, Assets (Net Acquisition of), Loans, Short-term original maturity (up to 1 year)"

它没有发现任何 split 。我在这里做错了什么?

非常感谢任何帮助!

最佳答案

这是一个解决方案:

strsplit(x, ", (?=[A-Z])", perl=T)

参见IDEONE demo

输出:

[[1]]
[1] "Non MMF investment funds"                                       
[2] "Insurance corporations"                                         
[3] "Assets (Net Acquisition of)"                                    
[4] "Loans"                                                          
[5] "Long-term original maturity (over 1 year or no stated maturity)"

[[2]]
[1] "Non financial corporations"                                                                                
[2] "Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds"
[3] "Assets (Net Acquisition of)"                                                                               
[4] "Loans"                                                                                                     
[5] "Short-term original maturity (up to 1 year)"

正则表达式 - ", (?=[A-Z])" - 包含一个前瞻 (?=[A-Z]),它检查但不消耗大写字母。在 R 中,您需要将 perl=T 与包含环视的正则表达式结合使用。

如果空格是可选的,或者逗号和大写字母之间可以有双空格,请使用

strsplit(x, ",\\s*(?=[A-Z])", perl=T)

还有一种支持 Unicode 字母的变体(使用 \\p{Lu}):

strsplit(x, ", (?=\\p{Lu})", perl=T)

关于regex - R:strsplit 中的正则表达式(查找 ", "后跟大写字母),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33759645/

相关文章:

c# - 正则表达式,获取其余不匹配的字符串

r - 使用 caret 包找到 GBM 的最佳参数

r - 使用 tryCatch 在函数内自定义错误消息

r - 如何为列表中的项目指定连续名称?

r - 代入计算结果

regex - 是否有仅包含小写字母数字字符和连字符的字符串的名称?

java - 多行正则表达式匹配问题

.net - 匹配不以 [ 且不以 ] 结尾的行的正则表达式(ini header )

r - 无法在 centOS 7.0 64 位上安装 git2r 或 devtools R 包

r - 使用分隔符分割字符串(括号中除外),并保留分隔符