编辑:我有一个数据框,其中第 1 列在某些文本上具有 id,第 2 列将文本本身作为字符串。我有一组多个单词,任务是有 stringr
计算每个单词在文本中出现的次数。这些词将作为固定的,而不是正则表达式提供。
两个问题突出:
(1) 如何提供包含多个单词的向量作为固定(非正则表达式)模式?
(2) 如何将结果附加到数据框中?
(3)对于非常大的数据怎么做?
用户@akrun 之前的回答回答了 (1) 和 (2) 点,但 (3) 仍然是一个问题。这是一个可重现的示例。
## create a very large data.frame with the text column to be analyzed
doc_number <- c()
doc_text <- c()
for(i in 1:60000){
# generate many random strings mentioning 'proposals'
doc_number[i] <- paste0("doc_",i)
set.seed(i+3)
doc_text[i] <- paste0("This is about proposal ", "(", sample(1000:9999, 1), "/", sample(letters, 1),")",
" and about proposal ", "(", sample(1000:9999, 1), "/", sample(letters, 1),")")
}
docs_example_df <- data.frame(doc_number, doc_text)
head(docs_example_df) # resulting df has 'doc_text' column which mentions proposals
> head(docs_example_df)
doc_number doc_text
1 doc_1 This is about proposal (6623/k) and about proposal (3866/c)
2 doc_2 This is about proposal (3254/k) and about proposal (2832/u)
3 doc_3 This is about proposal (7964/j) and about proposal (1940/n)
4 doc_4 This is about proposal (8582/g) and about proposal (3753/o)
5 doc_5 This is about proposal (4254/b) and about proposal (5686/l)
6 doc_6 This is about proposal (2588/f) and about proposal (9786/c)
# create a very large vector of 'proposals' I want to extract from doc_text
my_proposals <- c()
for(i in 1:20000){
set.seed(i+8)
my_proposals[i] <- paste0("proposal ", "(", sample(1000:9999, 1), "/", sample(letters, 1),")")
}
head(my_proposals) # long list of 'proposals' I wish to locate
> head(my_proposals)
[1] "proposal (2588/f)" "proposal (1490/i)" "proposal (2785/b)" "proposal (5545/z)" "proposal (6988/j)" "proposal (1264/i)"
@akrun 的上一个答案(见下文)推荐了几种适用于小型 data.frame 的解决方案。但是在这种 >20k 的对象中,函数要么卡住要么给出错误,例如:Problem with mutate() input matches. x Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)
因此,简而言之,如何将一个很长的向量列表应用到一个也很长的 data.frame 并将提取的匹配项存储在 data.frame 中的列列表中?谢谢大家
最佳答案
我们可以 paste
将它们放在一起并包裹在 regex
中而不是 fixed
.在 dplyr
1.0.0,引入了多种功能,其中之一是across
library(dplyr) #1.0.0
library(stringr)
test_df %>%
mutate(matches = str_extract_all(text,
pattern = regex(str_c(keywords, collapse = "|"))))
如果我们需要最终的预期输出,在创建
list
之后栏目matches
, unnest
要展开行,获取 count
并使用 pivot_wider
将其 reshape 为“宽”格式library(tidyr)
test_df %>%
mutate(matches = str_extract_all(test_df$text, pattern = regex(str_c(keywords, collapse = "|")))) %>%
unnest(c(matches)) %>%
count(across(doc_id:matches)) %>%
pivot_wider(names_from = matches, values_from = n, values_fill = list(n = 0))
# A tibble: 4 x 6
# doc_id text water alcohol gasoline h2o
# <chr> <chr> <int> <int> <int> <int>
#1 doc1 This text refers to water 1 0 0 0
#2 doc2 This text refers to water and alcohol 1 1 0 0
#3 doc4 This text refers to gasoline and more gasoline 0 0 2 0
#4 doc5 This text refers to (h2o) 0 0 0 1
如果我们有一个 dplyr
< 1.0.0,而不是 across
只需在 count
中指定列的名称... %>%
count(doc_id, text, matches)
... %>%
或将列名转换为符号并计算 ... %>%
count(!!! rlang::syms(names(.)))
... %>%
在上述方法中,'doc3' 被删除,因为没有匹配项。如果我们需要保留它,请指定
keep_empty = TRUE
在 unnest
test_df %>%
mutate(matches = str_extract_all(test_df$text,
pattern = regex(str_c(keywords, collapse = "|")))) %>%
unnest(c(matches), keep_empty = TRUE) %>%
count(across(doc_id:matches)) %>%
mutate(n = replace(n, is.na(matches), 0)) %>%
pivot_wider(names_from = matches, values_from = n, values_fill = list(n = 0)) %>%
select(-`NA`)
# A tibble: 5 x 6
# doc_id text water alcohol gasoline h2o
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 doc1 This text refers to water 1 0 0 0
#2 doc2 This text refers to water and alcohol 1 1 0 0
#3 doc3 This text refers to alcoolh 0 0 0 0
#4 doc4 This text refers to gasoline and more gasoline 0 0 2 0
#5 doc5 This text refers to (h2o) 0 0 0 1
除了上述方法,更简单的选择是使用
str_count
library(purrr)
map_dfc(set_names(keywords, keywords), ~
str_count(test_df$text, .x)) %>%
bind_cols(test_df, .)
# doc_id text water alcohol gasoline (h2o)
#1 doc1 This text refers to water 1 0 0 0
#2 doc2 This text refers to water and alcohol 1 1 0 0
#3 doc3 This text refers to alcoolh 0 0 0 0
#4 doc4 This text refers to gasoline and more gasoline 0 0 2 0
#5 doc5 This text refers to (h2o) 0 0 0 1
或使用
base R
test_df[keywords] <- lapply(keywords, function(x)
lengths(regmatches(test_df$text, gregexpr(x, test_df$text))))
虽然
str_extract
被矢量化为 pattern
,这将是 pattern
长度将与列长度相同,它将进行相应的提取
关于r - Stringr:从 data.frame 列中的字符串中提取所有匹配项。数据框和搜索字符串的向量非常大(> 10k),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62841189/