r - 将重置匹配 token \K 与 stringr 函数一起使用

我一直在回答这个 Creating a dataframe with text from a website我遇到了一个奇怪的案例，我无法理解。

我们已将以下行复制到剪贴板:

Leading Men (Average American male: 5 feet 9.5 inches)

Dolph Lundgren — 6 feet 5 inches
John Cleese — 6 feet 5 inches

Leading Ladies (Average American female: 5 feet 4 inches)

Uma Thurman — 6 feet 0 inches
Brooke Shields — 6 feet 0 inches

我提供了下面的解决方案，它从标题行中提取性别并用它填充以下行/行。这里的问题在于它提取了“领导”一词以及“性别”。我期望能够使用 \K(重置匹配标记)来摆脱它，但这不起作用。

web.lines <- read.delim("clipboard", header = F) # reading data from clipboard

library(tidyverse)

web.lines %>% 
  mutate(gender = str_extract(V1, "Leading\\s+\\b(\\w+)\\b")) %>%
  fill(gender , .direction = "down") %>% 
  group_by(gender) %>% 
  slice(-1) %>% # removing the headers
  separate(V1, into = c("Name", "Height"), sep = " — ") 

#> # A tibble: 4 x 3
#> # Groups:   gender [2]
#>    Name                  Height             gender        
#>    <chr>                 <chr>              <chr>         
#> 1  Uma Thurman           6 feet 0 inches    Leading Ladies
#> 2  Brooke Shields        6 feet 0 inches    Leading Ladies
#> 3 Dolph Lundgren         6 feet 5 inches    Leading Men   
#> 4 John Cleese            6 feet 5 inches    Leading Men

我试过的是 Leading\\s+\\K\\w+ 似乎在演示中有效 https://regex101.com/r/pYaW7a/1但不是 str_extract。

最佳答案

在 stringr 正则表达式函数中不需要 \K 不支持它(参见 ICU regex syntax documentation )，因为你有 str_match / str_match_all功能。

\K match reset operator PCRE、Perl、Onigmo、Python PyPi regex 和 Boost regex 库支持，因此也可通过 perl=TRUE 参数在基本 R regex 函数中使用，用于省略在当前位置之前匹配的一些文本。使用捕获组可以达到相同的效果。 str_extract 和 str_extract_all 的问题在于它们不会在输出中保留捕获的子字符串。 str_match/str_match_all keep 在其输出中捕获的子字符串。

查看 R 演示:

web.lines %>% 
  mutate(gender = str_match(V1, "Leading\\s+(\\w+)")[,2]) %>%
  fill(gender , .direction = "down") %>% 
  group_by(gender) %>% 
  slice(-1) %>% # removing the headers
  separate(V1, into = c("Name", "Height"), sep = " — ")

输出:

# A tibble: 4 x 3
# Groups:   gender [2]
  Name           Height          gender
  <chr>          <chr>           <chr> 
1 Uma Thurman    6 feet 0 inches Ladies
2 Brooke Shields 6 feet 0 inches Ladies
3 Dolph Lundgren 6 feet 5 inches Men   
4 John Cleese    6 feet 5 inches Men

这里，str_match(V1, "Leading\\s+(\\w+)")[,2] 用于匹配和捕获一个或多个字符在 前导 单词和一个或多个空格之后，并通过访问 [,2] 索引处的项目仅返回捕获的值。

注意这里的单词边界是多余的，在空格和单词 char 之间有一个隐式的单词边界，\w+ 之后的 \b 也隐式存在。

关于r - 将重置匹配 token \K 与 stringr 函数一起使用，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/64466784/

r - 将重置匹配 token \K 与 stringr 函数一起使用

上一篇：python - 有没有办法知道在 python 中是向上舍入还是向下舍入？

下一篇：postgresql - 了解 Postgres provider/terraform registry - 升级 v0.13