r - Stringr:从 data.frame 列中的字符串中提取所有匹配项。数据框和搜索字符串的向量非常大(> 10k)

标签 r stringr

编辑:我有一个数据框,其中第 1 列在某些文本上具有 id,第 2 列将文本本身作为字符串。我有一组多个单词,任务是有 stringr计算每个单词在文本中出现的次数。这些词将作为固定的,而不是正则表达式提供。
两个问题突出:
(1) 如何提供包含多个单词的向量作为固定(非正则表达式)模式?
(2) 如何将结果附加到数据框中?
(3)对于非常大的数据怎么做?
用户@akrun 之前的回答回答了 (1) 和 (2) 点,但 (3) 仍然是一个问题。这是一个可重现的示例。

## create a very large data.frame with the text column to be analyzed
doc_number <- c()
doc_text <- c()

for(i in 1:60000){

# generate many random strings mentioning 'proposals'
doc_number[i] <- paste0("doc_",i)
set.seed(i+3)
doc_text[i] <- paste0("This is about proposal ", "(", sample(1000:9999, 1), "/", sample(letters, 1),")",
                      " and about proposal ", "(", sample(1000:9999, 1), "/", sample(letters, 1),")")

}
docs_example_df <- data.frame(doc_number, doc_text)

head(docs_example_df) # resulting df has 'doc_text' column which mentions proposals
> head(docs_example_df)
  doc_number                                                    doc_text
1      doc_1 This is about proposal (6623/k) and about proposal (3866/c)
2      doc_2 This is about proposal (3254/k) and about proposal (2832/u)
3      doc_3 This is about proposal (7964/j) and about proposal (1940/n)
4      doc_4 This is about proposal (8582/g) and about proposal (3753/o)
5      doc_5 This is about proposal (4254/b) and about proposal (5686/l)
6      doc_6 This is about proposal (2588/f) and about proposal (9786/c)


# create a very large vector of 'proposals' I want to extract from doc_text
my_proposals <- c()

for(i in 1:20000){

  set.seed(i+8)
  my_proposals[i] <- paste0("proposal ", "(", sample(1000:9999, 1), "/", sample(letters, 1),")")

}

head(my_proposals) # long list of 'proposals' I wish to locate
> head(my_proposals)
[1] "proposal (2588/f)" "proposal (1490/i)" "proposal (2785/b)" "proposal (5545/z)" "proposal (6988/j)" "proposal (1264/i)"

@akrun 的上一个答案(见下文)推荐了几种适用于小型 data.frame 的解决方案。但是在这种 >20k 的对象中,函数要么卡住要么给出错误,例如:
Problem with mutate() input matches. x Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)
因此,简而言之,如何将一个很长的向量列表应用到一个也很长的 data.frame 并将提取的匹配项存储在 data.frame 中的列列表中?

谢谢大家

最佳答案

我们可以 paste将它们放在一起并包裹在 regex 中而不是 fixed .在 dplyr 1.0.0,引入了多种功能,其中之一是across

library(dplyr) #1.0.0
library(stringr)
test_df %>%
  mutate(matches = str_extract_all(text,
                pattern = regex(str_c(keywords, collapse = "|"))))

如果我们需要最终的预期输出,在创建 list 之后栏目matches , unnest要展开行,获取 count并使用 pivot_wider 将其 reshape 为“宽”格式
library(tidyr)
test_df %>%
   mutate(matches = str_extract_all(test_df$text, pattern = regex(str_c(keywords, collapse = "|")))) %>% 
   unnest(c(matches)) %>% 
   count(across(doc_id:matches)) %>% 
   pivot_wider(names_from = matches, values_from = n, values_fill = list(n = 0))
# A tibble: 4 x 6
#  doc_id text                                           water alcohol gasoline   h2o
#  <chr>  <chr>                                          <int>   <int>    <int> <int>
#1 doc1   This text refers to water                          1       0        0     0
#2 doc2   This text refers to water and alcohol              1       1        0     0
#3 doc4   This text refers to gasoline and more gasoline     0       0        2     0
#4 doc5   This text refers to (h2o)                          0       0        0     1
如果我们有一个 dplyr < 1.0.0,而不是 across只需在 count 中指定列的名称
... %>%
count(doc_id, text, matches)
... %>%
或将列名转换为符号并计算
 ... %>%
   count(!!! rlang::syms(names(.)))
... %>%

 

在上述方法中,'doc3' 被删除,因为没有匹配项。如果我们需要保留它,请指定 keep_empty = TRUEunnest
test_df %>%
    mutate(matches = str_extract_all(test_df$text, 
          pattern = regex(str_c(keywords, collapse = "|")))) %>% 
    unnest(c(matches), keep_empty = TRUE) %>% 
    count(across(doc_id:matches)) %>% 
    mutate(n = replace(n, is.na(matches), 0)) %>% 
    pivot_wider(names_from = matches, values_from = n, values_fill = list(n = 0)) %>%
    select(-`NA`)
# A tibble: 5 x 6
#  doc_id text                                           water alcohol gasoline   h2o
#  <chr>  <chr>                                          <dbl>   <dbl>    <dbl> <dbl>
#1 doc1   This text refers to water                          1       0        0     0
#2 doc2   This text refers to water and alcohol              1       1        0     0
#3 doc3   This text refers to alcoolh                        0       0        0     0
#4 doc4   This text refers to gasoline and more gasoline     0       0        2     0
#5 doc5   This text refers to (h2o)                          0       0        0     1

除了上述方法,更简单的选择是使用 str_count
library(purrr)
map_dfc(set_names(keywords, keywords), ~ 
      str_count(test_df$text, .x)) %>% 
   bind_cols(test_df, .)
#  doc_id                                           text water alcohol gasoline (h2o)
#1   doc1                      This text refers to water     1       0        0     0
#2   doc2          This text refers to water and alcohol     1       1        0     0
#3   doc3                    This text refers to alcoolh     0       0        0     0
#4   doc4 This text refers to gasoline and more gasoline     0       0        2     0
#5   doc5                      This text refers to (h2o)     0       0        0     1

或使用 base R
test_df[keywords] <-  lapply(keywords, function(x) 
        lengths(regmatches(test_df$text, gregexpr(x, test_df$text))))

虽然 str_extract被矢量化为 pattern ,这将是 pattern长度将与列长度相同,它将进行相应的提取

关于r - Stringr:从 data.frame 列中的字符串中提取所有匹配项。数据框和搜索字符串的向量非常大(> 10k),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62841189/

相关文章:

r - 如何通过分隔符拆分 r 中的字符串并丢弃最后两项?

r - 使用与现有 data.frame 相同的列和行名称初始化空白 data.frame

R:扩展一个序列,使得序列中任何成员的值成为它的位置,未填充的位置编码为 0 或 NA

r - 通过使用环境避免复制

r - R 中的日期时间/日期操作

用字符串替换列表中找到的字符串

sql - R 和 SQL Server 2008

R 中的正则表达式用于匹配仅包含非单词字符的单词

r - 在一行中按字数对字符串向量进行子集化

r - 将字符串部分提取到 R 中的列