我正在尝试在特定字符串之后提取一定数量的单词。
library(stringr)
x <- data.frame(end = c("source: from animal origin as Vitamin A / all-trans-Retinol: Fish in general, liver and dairy products;", "source: Eggs, liver, certain fish species such as sardines, certain mushroom species such as shiitake", "source: Leafy green vegetables such as spinach; egg yolks; liver"))
例如提取“source”后面的4个词,我从另一个问题中了解到使用此代码:
trimws(stringr::str_extract(x$end, '(?<=source:\\s)(\\w+,?\\s){4}'))
这很好用,但是,如果我尝试选择 8 个单词,我注意到它无法识别“/”并为第一个字符串返回 NA。
trimws(stringr::str_extract(x$end, '(?<=source:\\s)(\\w+,?\\s){8}'))
问题是:是否有一个正则表达式包含特殊字符(或绕过它们),所以我仍然可以提取所需的单词?我注意到其他字符(例如 - )或双空格也会发生同样的情况。
8 个单词的预期输出应该是这样的:
from animal origin as Vitamin A / all-trans-Retinol
是否将/和 - 算作单词并不重要,因为我总是可以将量词的数量调整为更多(在我的情况下,我不介意提取超出我需要的内容)。
谢谢
最佳答案
你可以依赖\S
匹配任何非空白字符的速记字符类:
(?<=source:\s)\S+(?:\s+\S+){3,7}\b
见 regex demo .详情:
-
(?<=source:\s)
- 紧接在source:
之前的位置和一个空格 -
\S+
- 一个或多个非空白字符 -
(?:\s+\S+){3,7}
- 三到七次出现 1+ 个空格,然后是 1+ 个非空白字符 -
\b
- 单词边界。
见 R demo online :
library(stringr)
x <- data.frame(end = c("source: from animal origin as Vitamin A / alltrans-Retinol: Fish in general, liver and dairy products;", "source: Eggs, liver, certain fish species such as sardines, certain mushroom species such as shiitake", "source: Leafy green vegetables such as spinach; egg yolks; liver"))
stringr::str_extract(x$end, "(?<=source:\\s)\\S+(?:\\s+\\S+){3,7}\\b")
输出:
[1] "from animal origin as Vitamin A / alltrans-Retinol"
[2] "Eggs, liver, certain fish species such as sardines"
[3] "Leafy green vegetables such as spinach; egg yolks"
关于r - 在R中的字符串后提取一定数量的单词或特殊字符,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63927507/