用于提取 R dplyr 中 "--!!"之后所有文本的正则表达式

我正在尝试使用 R 中的 dplyr 在数据框中的变量字符串之后提取子字符串，该数据帧由以下示例中的变量 name 的某些实例过滤。我正在尝试将所需结果传递到名为 income_rent 的新变量中。

我是正则表达式的新手。我的尝试是:

income_cashrent <- v18 %>% 
filter(str_detect(name, "B25122")) %>% 
mutate(income_rent = str_extract(label, "[^--!!]*$"))

但是，我得到的结果是: stri_extract_first_regex(string, pattern, opts_regex = opts(pattern)) 中的错误:正则表达式模式中的语法错误。 (U_REGEX_RULE_SYNTAX)

name的前四行是:

Estimate!!Total
Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000
Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent
Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent!!Less than $100

期望的结果是:

[not sure how to indicate an empty result here]
Less than $10,000
Less than $10,000!!With cash rent
Less than $10,000!!With cash rent!!Less than $100

到目前为止，我无法调试它，请引用堆栈上的其他正则表达式示例。任何指导将是最受欢迎的。提前致谢!

最佳答案

regmatches(vec, gregexpr("(?<=--!!).*", vec, perl = TRUE))
# [[1]]
# character(0)
# [[2]]
# [1] "Less than $10,000"
# [[3]]
# [1] "Less than $10,000!!With cash rent"
# [[4]]
# [1] "Less than $10,000!!With cash rent!!Less than $100"

如果您从这里unlist，您会注意到您“丢失”了第一个条目，不确定这是否是一个问题。

unlist(regmatches(vec, gregexpr("(?<=--!!).*", vec, perl = TRUE)))
# [1] "Less than $10,000"                                
# [2] "Less than $10,000!!With cash rent"                
# [3] "Less than $10,000!!With cash rent!!Less than $100"

如果这是一个问题，那么

vecout <- regmatches(vec, gregexpr("(?<=--!!).*", vec, perl = TRUE))
unlist(replace(vecout, lengths(vecout) < 1, NA))
# [1] NA                                                 
# [2] "Less than $10,000"                                
# [3] "Less than $10,000!!With cash rent"                
# [4] "Less than $10,000!!With cash rent!!Less than $100"

(或者您也可以用 "" 替换。)

在 dplyr 管道中:

tibble(vec = c("Estimate!!Total",
# "Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000",
# "Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent",
# "Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent!!Less than $100")) %>%
  mutate(out = regmatches(vec, gregexpr("(?<=--!!).*", vec, perl = TRUE)), out = replace(out, lengths(vecout) < 1, NA), out = unlist(out))
+ + # A tibble: 4 x 2
#   vec                                             out                           
#   <chr>                                           <chr>                         
# 1 Estimate!!Total                                 <NA>                          
# 2 Estimate!!Total!!Household income in the past ~ Less than $10,000             
# 3 Estimate!!Total!!Household income in the past ~ Less than $10,000!!With cash ~
# 4 Estimate!!Total!!Household income in the past ~ Less than $10,000!!With cash ~

数据:

vec <- c("Estimate!!Total",
"Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000",
"Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent",
"Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent!!Less than $100")

关于用于提取 R dplyr 中 "--!!"之后所有文本的正则表达式，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/61529474/

用于提取 R dplyr 中 "--!!"之后所有文本的正则表达式

上一篇：r - 'order' 的输出对我来说没有意义

下一篇：google-tag-manager - 如何调试Google DFP？