r - R 数据框中的十个最高列值

标签 r max tm keyword-search

目前,我正在开发一个从文本 block 中提取关键字的项目。 以下是初始列表中前三项的示例。 (抱歉冗长)

descriptest<-c("Columbia University is one of the world's most important centers of research and at the same time a distinctive and distinguished learning environment for undergraduates and graduate students in many scholarly and professional fields. The University recognizes the importance of its location in New York City and seeks to link its research and teaching to the vast resources of a great metropolis. It seeks to attract a diverse and international faculty and student body, to support research and teaching on global issues, and to create academic relationships with many countries and regions. It expects all areas of the university to advance knowledge and learning at the highest level and to convey the products of its efforts to the world.", 
"", "UMass Amherst was born in 1863 as a land-grant agricultural college set on 310 rural acres with four faculty members, four wooden buildings, 56 students and a curriculum combining modern farming, science, technical courses, and liberal arts.\n\nOver time, the curriculum, facilities, and student body outgrew the institution's original mission. In 1892 the first female student enrolled and graduate degrees were authorized. By 1931, to reflect a broader curriculum, \"Mass Aggie\" had become Massachusetts State College. In 1947, \"Mass State\" became the University of Massachusetts at Amherst.\n\nImmediately after World War II, the university experienced rapid growth in facilities, programs and enrollment, with 4000 students in 1954. By 1964, undergraduate enrollment jumped to 10,500, as Baby Boomers came of age. The turbulent political environment also brought a \"sit-in\" to the newly constructed Whitmore Administration Building. By the end of the decade, the completion of Southwest Residential Complex, the Alumni Stadium and the establishment of many new academic departments gave UMass Amherst much of its modern stature.\n\nIn the 1970s continued growth gave rise to a shuttle bus service on campus as well as several important architectural additions: the Murray D. Lincoln Campus Center, with a hotel, office space, fine dining restaurant, campus store and passageway to a multi-level parking garage; the W.E.B. Du Bois Library, named \"tallest library in the world\" upon its completion in 1973; and the Fine Arts Center, with performance space for world-class music, dance and theater.\n\nThe next two decades saw the emergence of UMass Amherst as a major research facility with the construction of the Lederle Graduate Research Center and the Conte National Polymer Research Center. Other programs excelled as well. In 1996 UMass Basketball became Atlantic 10 Conference champs and went to the NCAA Final Four. Before the millennium, both the William D. Mullins Center, a multi-purpose sports and convocation facility, and the Paul Robsham Visitors Center bustled with activity, welcoming thousands of visitors to the campus each year.\n\nUMass Amherst entered the 21st century as the flagship campus of the state's five-campus University system, and enrollment of nearly 24,000 students and a national and international reputation for excellence.")

我希望在 R 中使用 tm 包来做到这一点,因为 DocumentTermMatrix 在处理大数据时是一个清晰的矩阵。此外,我还使用 TfIdf 的权重来对语料库中的关键字与条目本身中的关键字进行比较进行排名。

我陷入困境,因为我可以使用 max.col 来获取最大关键字,但是,我的矩阵有多个具有相等值的最大值,而且,我不仅想要最大值,我真的想要前十个最高值列表中的值。 下面是示例代码:

 library(RWeka)
 library(tm)
 library(koRpus)
 library(RKEA)
 library(corpora)
 library(wordcloud)
 library(plyr)
changeindextoname<-function(indexnumber){
name<-colnames(z2[indexnumber])
return(name)
}

removestuff<- function(d){
d <- tm_map(d, tolower)
d <- tm_map(d, removePunctuation)
d <- tm_map(d, removeNumbers)
d <- tm_map(d, stripWhitespace)
d <- tm_map(d, skipWords)
d <- tm_map(d, removeWords, stopwords('english'))
}

descripcorpora<-Corpus(VectorSource(descriptest))
descripcorpora<-removestuff(descripcorpora)
ddtm <- DocumentTermMatrix(descripcorpora, control = list(weighting=weightTfIdf, stopwords=T))
f2<-as.data.frame(inspect(ddtm))
z2<-f2
z3<-max.col(z2)
dfwithmax<-cbind(z3, z2)
dfwithmax$word<-lapply(dfwithmax$z3, changeindextoname)
finaldf<-subset(dfwithmax, select=c("z3", "word", "learning", "tallest", "center", "seeks", "teaching"))

最终的 df 如下所示:

finaldf
  z3     word   learning     tallest     center      seeks   teaching
1 106 learning 0.04953008 0.000000000 0.00000000 0.04953008 0.04953008
2 183  tallest 0.00000000 0.000000000 0.00000000 0.00000000 0.00000000
3  35   center 0.00000000 0.007204375 0.04322625 0.00000000 0.00000000

这种方法似乎有效,但是在第 1 行中无法适应“寻求”、“学习”和“教学”都具有相同值的事实。

此外,当所有列都为零时(如第 2 行),max.col 返回一个索引。我该如何摆脱这个呢?

我试图避免循环遍历列或行,因为这会花费很长时间,因为矩阵非常大。

对于如何编写一个可以应用或循环每一列并将其添加到列表中的函数,我将非常感谢任何建议或想法,然后我可以应用changeindextoname函数并返回列表中的列名。

提前谢谢您!

最佳答案

对于每个文档,前五个最高值:

apply(as.matrix(ddtm),1,function(x) 
         colnames(as.matrix(ddtm))[order(x,decreasing=TRUE)[1:5]])

  Docs
       1            2            3        
  [1,] "teaching"   "york"       "center" 
  [2,] "seeks"      "year"       "umass"  
  [3,] "learning"   "worlds"     "campus" 
  [4,] "university" "worldclass" "amherst"
  [5,] "research"   "world"      "four"   

请注意,您没有提供 skipWords 的代码,因此我使用这个:

skipWords <- function(x) removeWords(x, c(stopwords("english")

并参见tm_reduce来重写removestuff函数:

removestuff <- tm_reduce(x,list(tolower,removePunctuation,...)

关于r - R 数据框中的十个最高列值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20747323/

相关文章:

r - 如何在R中的数据框中创建具有相同ID的四个点的组合?

r - 如何显示比例 (0-1) 条形图以及具有不同轴的线数据?

Mysql:在一行中获取一个字段的最高值和另一个字段的最低值

r - 合并两个 data.frames 并用 df2 的值替换 df1 某些列的值

r - Simple Triplet Matrix (Document Term Matrix) 的基本操作

R - 嵌套列表到 tibble

r - 查找数据框中字符串的百分比,但每行只计算一次

c++ - 从 vector 中获取(下一个)最大对象

mysql - SQL max() 函数为具有最大值的行返回错误值

使用 tm() 从 R 中的语料库中删除非英语文本