重写原始帖子。我正在寻求消除对 plyr 的依赖。
我尝试将tapply 和lapply 拼接到我的代码中。 Tapply 适用于一个变量(性别),但不适用于 2 个变量(性别、成人)。滑入 lapply 响应不会通过分组变量返回单词列表,它只会返回一个大单词列表,分组变量位于顶部(因此对于人员,它返回一个单词列表,而不是每个人一个单词列表)。
对于本文的长度,我深表歉意,但如果不包括我正在开发的实际功能,它似乎无法让你们了解帮助我。
我将在答案中包含我根据您的建议更改功能的尝试,而不是在这里,以减少已经臃肿的帖子。另外,除非对主要问题有帮助,否则请不要评论额外的用户定义函数。它们正在进行中,只是为了向您展示问题所在。
PLYR 的正确输出:http://pastebin.com/mr9FvjpF
数据框
DATA<-structure(list(person = structure(c(4L, 1L, 5L, 4L, 1L, 3L, 1L,
4L, 3L, 2L, 1L), .Label = c("greg", "researcher", "sally", "sam",
"teacher"), class = "factor"), sex = structure(c(2L, 2L, 2L,
2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor"),
adult = c(0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L), state = structure(c(2L,
7L, 9L, 11L, 5L, 4L, 8L, 3L, 10L, 1L, 6L), .Label = c("Shall we move on? Good then.",
"Computer is fun. Not too fun.", "I distrust you.",
"How can we be certain?", "I am telling the truth!", "Im hungry. Lets eat. You already?",
"No its not, its ****.", "There is no way.", "What should we do?",
"What are you talking about?", "You liar, it stinks!"
), class = "factor"), code = structure(c(1L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L, 2L, 3L), .Label = c("K1", "K10", "K11",
"K2", "K3", "K4", "K5", "K6", "K7", "K8", "K9"), class = "factor")), .Names = c("person",
"sex", "adult", "state", "code"), row.names = c(NA, -11L), class = "data.frame")
#=====================
依赖用户定义的工具
Trim<-function (x) gsub("^\\s+|\\s+$", "", x)
bracketX<-function(text, bracket='all'){
switch(bracket,
square=sapply(text, function(x)gsub("\\[.+?\\]", "", x)),
round=sapply(text, function(x)gsub("\\(.+?\\)", "", x)),
curly=sapply(text, function(x)gsub("\\{.+?\\}", "", x)),
all={P1<-sapply(text, function(x)gsub("\\[.+?\\]", "", x))
P1<-sapply(P1, function(x)gsub("\\(.+?\\)", "", x))
sapply(P1, function(x)gsub("\\{.+?\\}", "", x))})
}
words <- function(x){as.vector(unlist(strsplit(x, " ")))}
word.split <- function(x) lapply(x, words)
strip <- function(x){
sentence <- gsub('[[:punct:]]', '', as.character(x))
sentence <- gsub('[[:cntrl:]]', '', sentence)
sentence <- gsub('\\d+', '', sentence)
Trim(tolower(sentence))
}
#=====================
兴趣函数
textLISTER <- function(dataframe = DFwcweb, text.var = "dialogue", group.vars = "person") {
require(plyr)
DF <- dataframe
DF$words <- Trim(as.character(bracketX(dataframe[, text.var])))
DF$words <- as.vector(word.split(strip(DF$words)))
#I'd like to get ride of the plyr dependency in the line below
dlply(DF, c(group.vars), summarise, words = as.vector(unlist(DF$words)))
}
#=====================
目前该代码适用于一个或多个分组变量。
textLISTER(DATA, 'state', 'person')
textLISTER(DATA, 'state', c('sex','adult'))
最佳答案
怎么样
d1 <- dlply(DF, .(sex, adult), summarise, words=as.vector(unlist(dia2word)))
d2 <- dlply(DF, .(person), summarise, words=as.vector(unlist(dia2word)))
ff <- function(x) {
u <- unlist(x)
data.frame(words=u,
row.names=seq(length(u)),
stringsAsFactors=FALSE)
}
d1B <- with(DF,lapply(split(dia2word,list(adult,sex)),ff))
all.equal(d1,d1B,check.attributes=FALSE) ## TRUE
d2B <- with(DF,lapply(split(dia2word,person),ff))
all.equal(d2,d2B,check.attributes=FALSE) ## TRUE
编辑:我没有仔细查看您的代码,但似乎您的问题可能与指定要隔离为字符串的组件有关。这是一个在代码中可能效果更好的变体。
target <- "dia2word"
categ <- c("adult","sex")
d1C <- lapply(split(DF[[target]],lapply(categ,getElement,object=DF)),ff)
all.equal(d1,d1B,d1C,check.attributes=FALSE)
categ <- "person"
d2C <- lapply(split(DF[[target]],lapply(categ,getElement,object=DF)),ff)
all.equal(d2,d2B,d2C,check.attributes=FALSE)
关于r - 消除 plyr 依赖,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8541164/