我有一个包含语音数据的数据框,如下所示:
df <- data.frame(
id = 1:12,
partcl = c("yeah yeah yeah absolutely", "well you know it 's", "oh well yeah that's right",
"yeah I mean well oh", "well erm well Peter will be there", "well yeah well",
"yes yes yes totally", "yeah yeah yeah yeah", "well well I did n't do it",
"er well yeah that 's true", "oh hey where 's he gone?", "er"
))
和一个带有关键字parts
的向量:
parts <- c("yeah", "oh", "no", "well", "mm", "yes", "so", "right", "er", "like")
我需要做的是过滤至少具有两个不同 parts
值的行。我可以做的是过滤至少具有两个 parts
值的行,无论它们是不同的还是相同的:
library(dplyr)
df %>%
filter(
str_count(partcl, paste0("\\b(", paste0(parts, collapse = "|"), ")\\b")) > 1
)
id partcl
1 1 yeah yeah yeah absolutely
2 3 oh well yeah that's right
3 4 yeah I mean well oh
4 5 well erm well Peter will be there
5 6 well yeah well
6 7 yes yes yes totally
7 8 yeah yeah yeah yeah
8 9 well well I did n't do it
9 10 er well yeah that 's true
我如何断言匹配的部分
是不同的,以便结果是这样的:
id partcl
1 3 oh well yeah that's right
2 4 yeah I mean well oh
3 6 well yeah well
4 10 er well yeah that 's true
最佳答案
这可能会有所帮助 - 使用 str_extract_all
提取关键字,然后使用 n_distinct
进行检查以过滤
具有多个的行独特的关键字
library(dplyr)
library(stringr)
library(purrr)
df %>%
filter(map_lgl(str_extract_all(partcl,
paste0("\\b(", paste0(parts, collapse = "|"), ")\\b")),
~ n_distinct(.x) > 1))
-输出
id partcl
1 3 oh well yeah that's right
2 4 yeah I mean well oh
3 6 well yeah well
4 10 er well yeah that 's true
关于r - 过滤行的条件是必须至少存在两个不同的关键字,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/71053784/