基于问题More efficient means of creating a corpus and DTM我已经准备好自己的方法来从大型语料库构建术语文档矩阵,(我希望)不需要术语 x 文档内存。
sparseTDM <- function(vc){
id = unlist(lapply(vc, function(x){x$meta$id}))
content = unlist(lapply(vc, function(x){x$content}))
out = strsplit(content, "\\s", perl = T)
names(out) = id
lev.terms = sort(unique(unlist(out)))
lev.docs = id
v1 = lapply(
out,
function(x, lev) {
sort(as.integer(factor(x, levels = lev, ordered = TRUE)))
},
lev = lev.terms
)
v2 = lapply(
seq_along(v1),
function(i, x, n){
rep(i,length(x[[i]]))
},
x = v1,
n = names(v1)
)
stm = data.frame(i = unlist(v1), j = unlist(v2)) %>%
group_by(i, j) %>%
tally() %>%
ungroup()
tmp = simple_triplet_matrix(
i = stm$i,
j = stm$j,
v = stm$n,
nrow = length(lev.terms),
ncol = length(lev.docs),
dimnames = list(Terms = lev.terms, Docs = lev.docs)
)
as.TermDocumentMatrix(tmp, weighting = weightTf)
}
它在计算 v1
时变慢。它运行了 30 分钟,我停止了它。
我准备了一个小例子:
b = paste0("string", 1:200000)
a = sample(b,80)
microbenchmark(
lapply(
list(a=a),
function(x, lev) {
sort(as.integer(factor(x, levels = lev, ordered = TRUE)))
},
lev = b
)
)
结果是:
Unit: milliseconds
expr min lq mean median uq max neval
... 25.80961 28.79981 31.59974 30.79836 33.02461 98.02512 100
Id 和 content 有 126522 个元素,Lev.terms 有 155591 个元素,看来我过早停止了处理。由于最终我将处理 ~6M 文档,我需要问...有什么方法可以加快这段代码的速度吗?
最佳答案
现在我已经加快了替换速度
sort(as.integer(factor(x, levels = lev, ordered = TRUE)))
与
ind = which(lev %in% x)
cnt = as.integer(factor(x, levels = lev[ind], ordered = TRUE))
sort(ind[cnt])
现在时间是:
expr min lq mean median uq max neval
... 5.248479 6.202161 6.892609 6.501382 7.313061 10.17205 100
关于R - 缓慢地对有序因子进行排序,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29463464/