r - tm_map 和停用词无法从 R 中创建的语料库中删除不需要的词

我有一个包含以下数据的结果数据框:

                   word freq
credit           credit  790
account         account  451
xxxxxxxx       xxxxxxxx  430
report           report  405
information information  368
reporting     reporting  345
consumer       consumer  331
accounts       accounts  300
debt               debt  170
company         company  152
xxxxxx         xxxxxx    147

我想做以下事情:

删除所有包含两个以上x的词，如xx, xxx, xxx 等等，因为这些词可以是小写或大写，所以必须先变成小写再去掉

我正在使用 tm_map 删除停用词，但它似乎没有用，我仍然在数据框中得到了上面不需要的词。

myCorpus <- Corpus(VectorSource(df$txt))
myStopwords <- c(stopwords('english'),"xxx", "xxxx", "xxxxx", 
                 "XXX", "XXXX", "XXXXX", "xxxx", "xxx", "xx", "xxxxxxxx",
                 "xxxxxxxx", "XXXXXX", "xxxxxx", "XXXXXXX", "xxxxxxx", "XXXXXXXX", "xxxxxxxx")
myCorpus <- tm_map(myCorpus, tolower)
myCorpus<- tm_map(myCorpus,removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

myTdm <- as.matrix(TermDocumentMatrix(myCorpus))
v <- sort(rowSums(myTdm), decreasing=TRUE)
FreqMat <- data.frame(word = names(v), freq=v, row.names = F)
head(FreqMat, 10)

上面的代码无法从语料库中删除不需要的词。

有没有其他方法可以解决这个问题？

最佳答案

涉及 dplyr 和 stringr 的一种可能性是:

df %>%
 mutate(word = tolower(word)) %>%
 filter(str_count(word, fixed("x")) <= 1)

         word freq
1      credit  790
2     account  451
3      report  405
4 information  368
5   reporting  345
6    consumer  331
7    accounts  300
8        debt  170
9     company  152

或者使用类似逻辑的 base R 可能性:

df[sapply(df[, 1], 
          function(x) length(grepRaw("x", tolower(x), all = TRUE, fixed = TRUE)) <= 1, 
          USE.NAMES = FALSE), ]

关于r - tm_map 和停用词无法从 R 中创建的语料库中删除不需要的词，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57656674/

r - tm_map 和停用词无法从 R 中创建的语料库中删除不需要的词

上一篇：CUDA 到 OpenCL : What is the equivalent of (blockIdx. x + blockIdx.ygridDim.x) openCL 中的 blockDim.x + threadIdx.x？

下一篇：winforms - C# - 使用列表框添加/删除

r - tm_map 和停用词无法从 R 中创建的语料库中删除不需要的词

上一篇：CUDA 到 OpenCL : What is the equivalent of (blockIdx. x + blockIdx.y*gridDim.x) * openCL 中的 blockDim.x + threadIdx.x？

下一篇：winforms - C# - 使用列表框添加/删除

上一篇：CUDA 到 OpenCL : What is the equivalent of (blockIdx. x + blockIdx.ygridDim.x) openCL 中的 blockDim.x + threadIdx.x？