R StemCompletion 中的警告和 TermDocumentMatrix 中的错误

我按照 here 的指示进行操作

在幻灯片中。 9 tolower 在我使用过的 tm 0.6 及以上版本中存在问题

myCorpus <- tm_map(myCorpus, content_transformer(tolower)

它与此重复stackoverflow 但运行 StemCompletion 时仍然出现错误

myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy)

我关注这个instruction对于变量 myCorpus 和 myCorpusCopy 到 PlainTextDocument

corpus <- tm_map(corpus, PlainTextDocument)

我能够执行

myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy)

但我收到 50 条警告

There were 50 or more warnings (use warnings() to see the first 50) warnings()

我收到了全部 50 条警告:

1: In grep(sprintf("^%s", w), dictionary, value = TRUE) : argument 'pattern' has length > 1 and only the first element will be used 2: In grep(sprintf("^%s", w), dictionary, value = TRUE) : argument 'pattern' has length > 1 and only the first element will be used 3: In grep(sprintf("^%s", w), dictionary, value = TRUE) : argument 'pattern' has length > 1 and only the first element will be used 4: In grep(sprintf("^%s", w), dictionary, value = TRUE) : argument 'pattern' has length > 1 and only the first element will be used 5: In grep(sprintf("^%s", w), dictionary, value = TRUE) : argument 'pattern' has length > 1 and only the first element will be used 6: In grep(sprintf("^%s", w), dictionary, value = TRUE) : argument 'pattern' has length > 1 and only the first element will be used 7: In grep(sprintf("^%s", w), dictionary, value = TRUE) : argument 'pattern' has length > 1 and only the first element will be used 8: In grep(sprintf("^%s", w), dictionary, value = TRUE) : argument 'pattern' has length > 1 and only the first element will be used 9: In grep(sprintf("^%s", w), dictionary, value = TRUE) : argument 'pattern' has length > 1 and only the first element will be used 10: In grep(sprintf("^%s", w), dictionary, value = TRUE) : argument 'pattern' has length > 1 and only the first element will be used

我尝试忽略警告并创建 TermDocumentMatrix()

tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(1,   
Inf)))

我收到错误:

Error: inherits(doc, "TextDocument") is not TRUE

最佳答案

以下是创建词干术语文档矩阵并随后重新完成词干标记的方法:

txt <- " was followed the instruction from here In slide no. 9 tolower has issue in package tm 0.6 and above I have used "
myCorpus <- Corpus(VectorSource(txt))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
tdm <- TermDocumentMatrix(myCorpus, control = list(stemming = TRUE)) 
cbind(stems = rownames(tdm), completed = stemCompletion(rownames(tdm), myCorpus))  
#          stems      completed    
# 0.6      "0.6"      "0.6"        
# abov     "abov"     "above"      
# and      "and"      "and"        
# follow   "follow"   "followed"   
# from     "from"     "from"       
# has      "has"      "has"        
# have     "have"     "have"       
# here     "here"     "here"       
# instruct "instruct" "instruction"
# issu     "issu"     "issue"      
# no.      "no."      "no."        
# packag   "packag"   "package"    
# slide    "slide"    "slide"      
# the      "the"      "the"        
# tolow    "tolow"    "tolower"    
# use      "use"      "used"       
# was      "was"      "was"

关于R StemCompletion 中的警告和 TermDocumentMatrix 中的错误，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30321770/

R StemCompletion 中的警告和 TermDocumentMatrix 中的错误

上一篇：r - 使用 ggplot 在轴末端剪辑标记

下一篇：SQL 将行转置为未定义数量的列