r-caret - 插入符不并行运行

标签 r-caret domc

实际的并行插入符取决于 R、插入符和 doMC 软件包。如 Parallelizing Caret code 中所述

有人和我在类似的环境中工作吗? R 插入符并行化正常工作的最大 R 版本是什么?

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=C                  LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] caret_6.0-52    ggplot2_1.0.1   lattice_0.20-31 doMC_1.3.3      iterators_1.0.7 foreach_1.4.2   RStudioAMI_0.2 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.1         magrittr_1.5        splines_3.2.1       MASS_7.3-41         munsell_0.4.2       colorspace_1.2-6   
 [7] minqa_1.2.4         car_2.1-0           stringr_1.0.0       plyr_1.8.3          tools_3.2.1         pbkrtest_0.4-2     
[13] nnet_7.3-9          grid_3.2.1          gtable_0.1.2        nlme_3.1-120        mgcv_1.8-6          quantreg_5.19      
[19] MatrixModels_0.4-1  gtools_3.5.0        lme4_1.1-9          digest_0.6.8        Matrix_1.2-0        nloptr_1.0.4       
[25] reshape2_1.4.1      codetools_0.2-11    stringi_0.5-5       BradleyTerry2_1.0-6 scales_0.3.0        stats4_3.2.1       
[31] SparseM_1.7         brglm_0.5-9         proto_0.3-10

更新1: 我的代码如下:

library(doMC) ; registerDoMC(cores=4)
library(caret)
classification_formula <- as.formula(paste("target" ,"~",
                                             paste(names(m_input_data)[!names(m_input_data)=='target'],collapse="+")))

CVfolds <- 2
CVreps  <- 5
ma_control <- trainControl(method = "repeatedcv",
                             number = CVfolds,
                             repeats = CVreps ,
                             returnResamp = "final" ,
                             classProbs = T,
                             summaryFunction = twoClassSummary,
                             allowParallel = TRUE,verboseIter = TRUE)
 rf_tuneGrid = expand.grid(mtry = seq(2,32, length.out = 6))
 rf <- train(classification_formula , data = m_input_data , method = "rf", metric="ROC" ,trControl = ma_control, tuneGrid = rf_tuneGrid , ntree = 101)

更新2: 当我从命令行运行时,只有一个核心正在工作 当我从 Rstudio 运行这些脚本时,并行正在工作,因为我看到 4 通过 top 进行处理。但一秒钟后错误发生了:

  Error in names(resamples) <- gsub("^\\.", "", names(resamples)) : 
   attempt to set an attribute on NULL 

更新4:

您好,问题似乎出在已终止的 R session 中。每次启动 AWS 实例时,我都会运行 R 代码,现在刷新 R 引擎。现在,每次刷新 Rstudio 浏览器时,我都会执行 Session -> Restart R 。看来它运行了。 我现在正在检查从 Ubuntu 命令行运行脚本是否相同。

一般情况下它会运行而没有完成。插入符号在数据级别上并行。这意味着它能够在不同的进程上处理每个重采样。但如果样本仍然很大(100,000/2(折叠数 = 2)X 2,000 个特征),这对于每个处理器单元来说可能很难完成。我说得对吗?

我认为并行性必须在算法级别。这意味着每个算法都可能在多个内核上运行。如果这样的算法实现在插入符号中可用???

最佳答案

我有 Linux 平台的最新版本,R 版本 3.2.2(2015-08-14,消防安全),并且并行化工作正常。您能否提供不能并行工作的代码。

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8    LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] kernlab_0.9-22  doMC_1.3.3      iterators_1.0.7 foreach_1.4.2   caret_6.0-52    ggplot2_1.0.1   lattice_0.20-33

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.0         compiler_3.2.2      nloptr_1.0.4        plyr_1.8.3          tools_3.2.2         digest_0.6.8       
 [7] lme4_1.1-9          nlme_3.1-122        gtable_0.1.2        mgcv_1.8-7          Matrix_1.2-2        brglm_0.5-9        
[13] SparseM_1.7         proto_0.3-10        BradleyTerry2_1.0-6 stringr_1.0.0       gtools_3.5.0        MatrixModels_0.4-1 
[19] stats4_3.2.2        grid_3.2.2          nnet_7.3-10         minqa_1.2.4         reshape2_1.4.1      car_2.0-26         
[25] magrittr_1.5        scales_0.3.0        codetools_0.2-11    MASS_7.3-43         splines_3.2.2       pbkrtest_0.4-2     
[31] colorspace_1.2-6    quantreg_5.18       stringi_0.5-5       munsell_0.4.2      

我已在本地计算机上将您的代码用于 BreastCancer 数据集,并且它可以并行运行,没有任何问题。我使用的是 RStudio 版本 0.98.1103。

library(caret)
library(mlbench)
data(BreastCancer)

library(doMC)  
registerDoMC(cores=2)

classification_formula <- as.formula(paste("Class" ,"~",
                                         paste(names(BreastCancer)[!names(BreastCancer)=='Class'],collapse="+")))

CVfolds <- 2
CVreps  <- 5
ma_control <- trainControl(method = "repeatedcv",
                           number = CVfolds,
                           repeats = CVreps ,
                           returnResamp = "final" ,
                           classProbs = T,
                           summaryFunction = twoClassSummary,
                           allowParallel = TRUE,verboseIter = TRUE)

rf_tuneGrid = expand.grid(mtry = seq(2,32, length.out = 6))

#Notice, it might be easier just to use Class~. 
#instead of classification_formula
rf <- train(classification_formula , 
            data = BreastCancer , 
            method = "rf", 
            metric="ROC" ,
            trControl = ma_control, 
            tuneGrid = rf_tuneGrid , 
            ntree = 101)

> rf
Random Forest 

699 samples
 10 predictors
  2 classes: 'benign', 'malignant' 

No pre-processing
Resampling: Cross-Validated (2 fold, repeated 5 times) 
Summary of sample sizes: 341, 342, 342, 341, 342, 341, ... 
Resampling results across tuning parameters:

 mtry  ROC        Sens       Spec       ROC SD       Sens SD      Spec SD    
   2    0.9867820  1.0000000  0.0000000  0.005007691  0.000000000  0.000000000
   8    0.9899107  0.9549550  0.9640196  0.002243649  0.006714919  0.017247716
  14    0.9907072  0.9558559  0.9631933  0.003028258  0.012345228  0.008019979
  20    0.9909514  0.9635135  0.9556513  0.003268291  0.006864342  0.010471005
  26    0.9911480  0.9630631  0.9539706  0.003384987  0.005113930  0.010628533
  32    0.9911485  0.9657658  0.9522969  0.002973508  0.004842197  0.004090206

ROC was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 32. 
> 

关于r-caret - 插入符不并行运行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32514370/

相关文章:

r - 使用 Caret CreateTimeSlices 通过机器学习模型进行增长窗口预测

r - 在 rpart 和 caret 中使用序数变量而不转换为虚拟分类变量

R foreach : Number of threads reduce to 1

r - doMC vs doSNOW vs doSMP vs doMPI : why aren't the various parallel backends for 'foreach' functionally equivalent?

R - 插入符 createDataPartition 返回比预期更多的样本

r - R Caret 包中的逻辑回归调整参数网格?

r - Predict() R 函数插入符包错误 : "newdata" rows different, "type"不接受

windows - R 中的 doMC 和 foreach 循环不起作用

r - R中doMC和doParallel的区别