用于提取 R dplyr 中 "--!!"之后所有文本的正则表达式

标签 r regex dplyr

我正在尝试使用 R 中的 dplyr 在数据框中的变量字符串之后提取子字符串,该数据帧由以下示例中的变量 name 的某些实例过滤。我正在尝试将所需结果传递到名为 income_rent 的新变量中。

我是正则表达式的新手。我的尝试是:

income_cashrent <- v18 %>% 
filter(str_detect(name, "B25122")) %>% 
mutate(income_rent = str_extract(label, "[^--!!]*$"))

但是,我得到的结果是: stri_extract_first_regex(string, pattern, opts_regex = opts(pattern)) 中的错误:正则表达式模式中的语法错误。 (U_REGEX_RULE_SYNTAX)

name的前四行是:

Estimate!!Total
Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000
Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent
Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent!!Less than $100

期望的结果是:

[not sure how to indicate an empty result here]
Less than $10,000
Less than $10,000!!With cash rent
Less than $10,000!!With cash rent!!Less than $100

到目前为止,我无法调试它,请引用堆栈上的其他正则表达式示例。任何指导将是最受欢迎的。提前致谢!

最佳答案

regmatches(vec, gregexpr("(?<=--!!).*", vec, perl = TRUE))
# [[1]]
# character(0)
# [[2]]
# [1] "Less than $10,000"
# [[3]]
# [1] "Less than $10,000!!With cash rent"
# [[4]]
# [1] "Less than $10,000!!With cash rent!!Less than $100"

如果您从这里unlist,您会注意到您“丢失”了第一个条目,不确定这是否是一个问题。

unlist(regmatches(vec, gregexpr("(?<=--!!).*", vec, perl = TRUE)))
# [1] "Less than $10,000"                                
# [2] "Less than $10,000!!With cash rent"                
# [3] "Less than $10,000!!With cash rent!!Less than $100"

如果这是一个问题,那么

vecout <- regmatches(vec, gregexpr("(?<=--!!).*", vec, perl = TRUE))
unlist(replace(vecout, lengths(vecout) < 1, NA))
# [1] NA                                                 
# [2] "Less than $10,000"                                
# [3] "Less than $10,000!!With cash rent"                
# [4] "Less than $10,000!!With cash rent!!Less than $100"

(或者您也可以用 "" 替换。)


dplyr 管道中:

tibble(vec = c("Estimate!!Total",
# "Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000",
# "Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent",
# "Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent!!Less than $100")) %>%
  mutate(out = regmatches(vec, gregexpr("(?<=--!!).*", vec, perl = TRUE)), out = replace(out, lengths(vecout) < 1, NA), out = unlist(out))
+ + # A tibble: 4 x 2
#   vec                                             out                           
#   <chr>                                           <chr>                         
# 1 Estimate!!Total                                 <NA>                          
# 2 Estimate!!Total!!Household income in the past ~ Less than $10,000             
# 3 Estimate!!Total!!Household income in the past ~ Less than $10,000!!With cash ~
# 4 Estimate!!Total!!Household income in the past ~ Less than $10,000!!With cash ~

数据:

vec <- c("Estimate!!Total",
"Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000",
"Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent",
"Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent!!Less than $100")

关于用于提取 R dplyr 中 "--!!"之后所有文本的正则表达式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61529474/

相关文章:

r - 按组填充多列的缺失值

r - 在R中是否可以按以下方式将两个向量或矩阵与向量相乘?

Python - 输入标签内正则表达式中的转义引号

r - 将函数变量传递给 `inner_join()` 中的 by 选项以按不同列连接两个表

r - 在列表列表中查找/查找字符串,然后返回列表的名称

r - 如何以 NULL 结束 dplyr 管道?允许轻松评论/取消评论

python - 将多个 CSV 文件合并到...并具有最大大小限制

java - 正则表达式匹配最短匹配而不是最长匹配

python - 将字符串包裹在单引号内

R:使用 igraph 有效地查找特殊大小的团