R 正则表达式用于匹配列/向量中逗号分隔的部分

The original Title for this Question was : R Regex for word boundary excluding space.It reflected the manner I was approaching the problem in. However, this is a better solution to my particular problem. It should work as long as a particular delimiter is used to separate items within a 'cell'

这一定很简单，但我却遇到了困难。我有一个数据框列，其中每个单元格(行)都是逗号分隔的项目列表。我想找到包含特定项目的行。

df<-data.frame( nms=  c("XXXCAP,XXX CAPITAL LIMITED" , "XXX,XXX POLYMERS LIMITED, 3455" , "YYY,XXX REP LIMITED,999,XXX" ), 
        b = c('A', 'X', "T"))

                             nms b
1     XXXCAP,XXX CAPITAL LIMITED A
2 XXX,XXX POLYMERS LIMITED, 3455 X
3    YYY,XXX REP LIMITED,999,XXX T

I want to search for rows that have item XXX. Rows 2 and 3 should match. Row 1 has the string XXX as part of a larger string and obviously should not match.

However, because XXX in row 1 is separated by spaces in each side, I am having trouble filtering it out with \\b or [[:<:]]

grep("\\bXXX\\b",df$nms, value = F) #matches 1,2,3

最简单的方法当然是 strsplit() 但我想避免它。欢迎任何有关性能的建议。

最佳答案

何时 \b不“工作”，问题通常在于“整个单词”的定义。

一个word boundary可以出现在以下三个位置之一:

在字符串中的第一个字符之前，如果第一个字符是单词字符。
在字符串中的最后一个字符之后，如果最后一个字符是单词字符。
字符串中的两个字符之间，其中一个是单词字符，另一个不是单词字符。

看来您只想匹配逗号或字符串开头/结尾之间的单词。

您可以使用 PCRE 正则表达式(注意 perl=TRUE 参数)，例如

(?<![^,])XXX(?![^,])

请参阅regex demo (该表达式被“转换”为使用正向查找，因为它是一个具有单个多行字符串的演示)。

详细信息

(?<![^,]) (等于 (?<=^|,) ) - 字符串的开头或逗号
XXX - 一个XXX词
(?![^,]) (等于 (?=$|,) ) - 字符串结尾或逗号

R 演示:

> grep("(?<![^,])XXX(?![^,])",df$nms, value = FALSE, perl=TRUE)
## => [1] 2 3

等效的 TRE 正则表达式如下所示

> grep("(?:^|,)XXX(?:$|,)",df$nms, value = FALSE)

请注意，non-capturing groups用于匹配字符串的开头或 , (参见 (?:^|,) )以及字符串的任一结尾或 , (参见((?:$|,)))。

关于R 正则表达式用于匹配列/向量中逗号分隔的部分，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51466957/

R 正则表达式用于匹配列/向量中逗号分隔的部分

上一篇：r - 使用 dplyr 为给定组创建唯一值组合的向量

下一篇：Clojure gorilla repl 和 JVM 10 异常