R StemCompletion 中的警告和 TermDocumentMatrix 中的错误

标签 r text-mining tm

我按照 here 的指示进行操作

在幻灯片中。 9 tolower 在我使用过的 tm 0.6 及以上版本中存在问题

myCorpus <- tm_map(myCorpus, content_transformer(tolower)

它与此重复stackoverflow 但运行 StemCompletion 时仍然出现错误

myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy)

我关注这个instruction对于变量 myCorpus 和 myCorpusCopy 到 PlainTextDocument

corpus <- tm_map(corpus, PlainTextDocument)

我能够执行

myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy)

但我收到 50 条警告

There were 50 or more warnings (use warnings() to see the first 50) warnings()

我收到了全部 50 条警告:

1: In grep(sprintf("^%s", w), dictionary, value = TRUE) : argument 'pattern' has length > 1 and only the first element will be used 2: In grep(sprintf("^%s", w), dictionary, value = TRUE) : argument 'pattern' has length > 1 and only the first element will be used 3: In grep(sprintf("^%s", w), dictionary, value = TRUE) : argument 'pattern' has length > 1 and only the first element will be used 4: In grep(sprintf("^%s", w), dictionary, value = TRUE) : argument 'pattern' has length > 1 and only the first element will be used 5: In grep(sprintf("^%s", w), dictionary, value = TRUE) : argument 'pattern' has length > 1 and only the first element will be used 6: In grep(sprintf("^%s", w), dictionary, value = TRUE) : argument 'pattern' has length > 1 and only the first element will be used 7: In grep(sprintf("^%s", w), dictionary, value = TRUE) : argument 'pattern' has length > 1 and only the first element will be used 8: In grep(sprintf("^%s", w), dictionary, value = TRUE) : argument 'pattern' has length > 1 and only the first element will be used 9: In grep(sprintf("^%s", w), dictionary, value = TRUE) : argument 'pattern' has length > 1 and only the first element will be used 10: In grep(sprintf("^%s", w), dictionary, value = TRUE) : argument 'pattern' has length > 1 and only the first element will be used

我尝试忽略警告并创建 TermDocumentMatrix()

tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(1,   
Inf)))

我收到错误:

Error: inherits(doc, "TextDocument") is not TRUE

最佳答案

以下是创建词干术语文档矩阵并随后重新完成词干标记的方法:

txt <- " was followed the instruction from here In slide no. 9 tolower has issue in package tm 0.6 and above I have used "
myCorpus <- Corpus(VectorSource(txt))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
tdm <- TermDocumentMatrix(myCorpus, control = list(stemming = TRUE)) 
cbind(stems = rownames(tdm), completed = stemCompletion(rownames(tdm), myCorpus))  
#          stems      completed    
# 0.6      "0.6"      "0.6"        
# abov     "abov"     "above"      
# and      "and"      "and"        
# follow   "follow"   "followed"   
# from     "from"     "from"       
# has      "has"      "has"        
# have     "have"     "have"       
# here     "here"     "here"       
# instruct "instruct" "instruction"
# issu     "issu"     "issue"      
# no.      "no."      "no."        
# packag   "packag"   "package"    
# slide    "slide"    "slide"      
# the      "the"      "the"        
# tolow    "tolow"    "tolower"    
# use      "use"      "used"       
# was      "was"      "was"    

关于R StemCompletion 中的警告和 TermDocumentMatrix 中的错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30321770/

相关文章:

R 语料库弄乱了我的 UTF-8 编码文本

R 使用 %in% 从字符向量中删除停用词

nlp - 爬网

r - 使用 R 从 Pubmed 数据中的隶属关系中提取大学名称

在 R 中读取 Outlook 邮件

r - 如何转换多个数据框中的列格式?

r - 如何从单个项目列表中删除未命名的元素?

r - 收集或转置多行数据作为 'key' 参数

r - 按类型列表列中的值过滤数据框

Java - 在文本挖掘上实现机器学习方法