r - tm 如何与雪互动?

标签 r tm snow

高性能任务 View 指出 tm可以使用 snow 进行并行文本挖掘 (High-Performance and Parallel Computing with R)。然而,我没有找到任何例子来说明如何做到这一点,尽管我发现了一些关于并行计算的讨论 tm (R/Finance 2012)。谁能解释一下 tmsnow 创建的集群的接口(interface)?

编辑:见下面 BenBarnes 的评论。具体来说:

According to ?tm_startCluster, that function looks for an MPI cluster (not a SOCK cluster) and "allow[s] 'tm' to use a cluster". Perhaps that would be an alternative to hadoop, since, given a few prerequisites, snow can set up an MPI cluster.

最佳答案

使用“r-project tm parallel”作为搜索策略的 LMGTFY 将其作为第三次命中:

Distributed Text Mining with tm

直接从幻灯片复制:
解决方案:
1.分布式存储
复制到 DFS 的数据集(“分布式语料库”)
只有关于语料库的元信息保留在内存中
2.并行计算
并行对所有元素进行计算操作 (Map)
MapReduce 范式
工作马 tm_map() 和 TermDocumentMatrix()
可以按需检索已处理的文档(修订)。

在 tm 的“插件”包中实现:tm.plugin.dc。

#Distributed Text Mining in R 
> library("tm.plugin.dc") 
> dc <- DistributedCorpus(DirSource("Data/reuters"), 
                          list(reader = readReut21578XML) ) 
> dc <- as.DistributedCorpus(Reuters21578) 
> summary(dc) 
#A corpus with 21578 text documents 
#The metadata consists of 2 tag-value pairs and a data frame 
#Available tags are: 
#create_date creator 
#Available variables in the data frame are: 
#MetaID 
--- Distributed Corpus --- 
#Available revisions: 
#20100417144823 
#Active revision: 20100417144823 
#DistributedCorpus: Storage 
#- Description: Local Disk Storage 
#- Base directory on storage: /tmp/RtmpuxX3W7/file5bd062c2 
#- Current chunk size [bytes]: 10485760 
> dc <- tm_map(dc, stemDocument)
> print(object.size(Reuters21578), units = "Mb") 
#109.5 Mb 
> dc 
#A corpus with 21578 text documents 
> dc_storage(dc) 
DistributedCorpus: Storage 
- Description: Local Disk Storage 
- Base directory on storage: /tmp/RtmpuxX3W7/file5bd062c2 
- Current chunk size [bytes]: 10485760 
> dc[[3]] 
#----------
Texas Commerce Bancshares Inc 
' 
s Texas 
Commerce Bank-Houston said it filed an application with the 
Comptroller of the Currency in an effort to create the largest 
banking network in Harris County. 
The bank said the network would link 31 banks having 
13.5 billion dlrs in assets and 7.5 billion dlrs in deposits. 
Reuter 
#---------
> print(object.size(dc), units = "Mb") 
# 0.6 Mb

使用以下术语进行进一步搜索:tm, snow ,parLapply ... produces this link:

使用此代码:
library(snow)
cl <- makeCluster(4, type="SOCK")

par(ask=TRUE)

bigsleep <- function(sleeptime, mat) Sys.sleep(sleeptime)
bigmatrix <- matrix(0, 2000, 2000)
sleeptime <- rep(1, 100)

tm <- snow.time(clusterApply(cl, sleeptime, bigsleep, bigmatrix))
plot(tm)
cat(sprintf("Elapsed time for clusterApply: %f\n", tm$elapsed))

tm <- snow.time(parLapply(cl, sleeptime, bigsleep, bigmatrix))
plot(tm)
cat(sprintf("Elapsed time for parLapply: %f\n", tm$elapsed))

stopCluster(cl)

关于r - tm 如何与雪互动?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11092621/

相关文章:

R foreach : from single-machine to cluster

r - foreach/SNOW/doSNOW 使用 RTerm 进行详细输出,但不使用 RGui

r - ggplot中每个方面的不同垂直线

r - 这个符号在表达式函数中是什么意思 : *~~

r - 给定列表的一个元素,如何恢复其在列表中的索引?

r - 绘制 LDA 主题随时间的演变

r - 安装胶水后 tidyverse 的问题

r - 如何使用 tm_map 将元数据添加到 tm Corpus 对象

r - ngram 的 dict 函数