我正在尝试使用 R 中的 dplyr 在数据框中的变量字符串之后提取子字符串,该数据帧由以下示例中的变量 name
的某些实例过滤。我正在尝试将所需结果传递到名为 income_rent
的新变量中。
我是正则表达式的新手。我的尝试是:
income_cashrent <- v18 %>%
filter(str_detect(name, "B25122")) %>%
mutate(income_rent = str_extract(label, "[^--!!]*$"))
但是,我得到的结果是:
stri_extract_first_regex(string, pattern, opts_regex = opts(pattern)) 中的错误:正则表达式模式中的语法错误。 (U_REGEX_RULE_SYNTAX)
name
的前四行是:
Estimate!!Total
Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000
Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent
Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent!!Less than $100
期望的结果是:
[not sure how to indicate an empty result here]
Less than $10,000
Less than $10,000!!With cash rent
Less than $10,000!!With cash rent!!Less than $100
到目前为止,我无法调试它,请引用堆栈上的其他正则表达式示例。任何指导将是最受欢迎的。提前致谢!
最佳答案
regmatches(vec, gregexpr("(?<=--!!).*", vec, perl = TRUE))
# [[1]]
# character(0)
# [[2]]
# [1] "Less than $10,000"
# [[3]]
# [1] "Less than $10,000!!With cash rent"
# [[4]]
# [1] "Less than $10,000!!With cash rent!!Less than $100"
如果您从这里unlist
,您会注意到您“丢失”了第一个条目,不确定这是否是一个问题。
unlist(regmatches(vec, gregexpr("(?<=--!!).*", vec, perl = TRUE)))
# [1] "Less than $10,000"
# [2] "Less than $10,000!!With cash rent"
# [3] "Less than $10,000!!With cash rent!!Less than $100"
如果这是一个问题,那么
vecout <- regmatches(vec, gregexpr("(?<=--!!).*", vec, perl = TRUE))
unlist(replace(vecout, lengths(vecout) < 1, NA))
# [1] NA
# [2] "Less than $10,000"
# [3] "Less than $10,000!!With cash rent"
# [4] "Less than $10,000!!With cash rent!!Less than $100"
(或者您也可以用 ""
替换。)
在 dplyr
管道中:
tibble(vec = c("Estimate!!Total",
# "Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000",
# "Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent",
# "Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent!!Less than $100")) %>%
mutate(out = regmatches(vec, gregexpr("(?<=--!!).*", vec, perl = TRUE)), out = replace(out, lengths(vecout) < 1, NA), out = unlist(out))
+ + # A tibble: 4 x 2
# vec out
# <chr> <chr>
# 1 Estimate!!Total <NA>
# 2 Estimate!!Total!!Household income in the past ~ Less than $10,000
# 3 Estimate!!Total!!Household income in the past ~ Less than $10,000!!With cash ~
# 4 Estimate!!Total!!Household income in the past ~ Less than $10,000!!With cash ~
数据:
vec <- c("Estimate!!Total",
"Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000",
"Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent",
"Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent!!Less than $100")
关于用于提取 R dplyr 中 "--!!"之后所有文本的正则表达式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61529474/