R tidytext 如果相关二元组的一部分则删除单词，如果不是则保留

通过使用 unnest_token，我想创建一个整洁的文本 tibble，它结合了两个不同的标记:单个单词和二元组。背后的原因是，有时单个单词是更合理的研究单位，有时则是更高阶的 n 元语法。

如果两个单词显示为“合理的”二元组，我想存储二元组而不是存储单个单词。如果相同的单词出现在不同的上下文中(即不是二元组)，那么我想将它们保存为单个单词。

在下面这个愚蠢的例子中，“of the”是一个重要的二元组。因此，我想删除单个单词“of”和“the”(如果它们实际上在文本中显示为“of the”)。但如果“of”和“the”以其他组合出现，我想将它们保留为单个单词。

library(janeaustenr)
library(data.table)
library(dplyr)
library(tidytext)
library(tidyr)


# make unigrams
tide <- unnest_tokens(austen_books() , output = word, input = text )
# make bigrams
tide2 <- unnest_tokens(austen_books(), output = bigrams, input = text, token = "ngrams", n = 2)

# keep only most frequent bigrams (in reality use more sensible metric)
keepbigram <- names( sort( table(tide2$bigrams), decreasing = T)[1:10]  )
keepbigram
tide2 <- tide2[tide2$bigrams %in% keepbigram,]

# this removes all unigrams which show up in relevant bigrams
biwords <- unlist( strsplit( keepbigram, " ") )
biwords
tide[!(tide$word %in% biwords),]

# want to keep biwords in tide if they are not part of bigrams

最佳答案

您可以通过在标记化之前将您感兴趣的二元组替换为文本中的复合词来实现此目的(即 unnest_tokens):

keepbigram_new <- stringi::stri_replace_all_regex(keepbigram, "\\s+", "_")
keepbigram_new
#>  [1] "of_the"   "to_be"    "in_the"   "it_was"   "i_am"     "she_had" 
#>  [7] "of_her"   "to_the"   "she_was"  "had_been"

使用 _ 代替空格是常见的做法。 stringi::stri_replace_all_regex 与 stringr 中的 gsub 或 str_replace 几乎相同，但速度更快，并且功能更多特点。

现在，在标记化之前用这些新的复合词替换文本中的二元词。我在二元组的开头和结尾使用单词边界正则表达式 (\\b)，以免意外捕获“of them”:

topwords <- austen_books() %>% 
  mutate(text = stringi::stri_replace_all_regex(text, paste0("\\b", keepbigram, "\\b"), keepbigram_new, vectorize_all = FALSE)) %>% 
  unnest_tokens(output = word, input = text) %>% 
  count(word, sort = TRUE) %>% 
  mutate(rank = seq_along(word))

看看最常见的单词，第一个二元组现在出现在第 40 名:

topwords %>% 
  slice(1:4, 39:41)
#> # A tibble: 7 x 3
#>   word       n  rank
#>   <chr>  <int> <int>
#> 1 and    22515     1
#> 2 to     20152     2
#> 3 the    20072     3
#> 4 of     16984     4
#> 5 they    2983    39
#> 6 of_the  2833    40
#> 7 from    2795    41

关于R tidytext 如果相关二元组的一部分则删除单词，如果不是则保留，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/60721175/

R tidytext 如果相关二元组的一部分则删除单词，如果不是则保留

上一篇：c# - C# 中的多个 MongoDb 过滤器

下一篇：scala - 如何从 scala TypeTag 获取通用简单类名？