r-caret - 插入符不并行运行

实际的并行插入符取决于 R、插入符和 doMC 软件包。如 Parallelizing Caret code 中所述

有人和我在类似的环境中工作吗？ R 插入符并行化正常工作的最大 R 版本是什么？

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=C                  LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] caret_6.0-52    ggplot2_1.0.1   lattice_0.20-31 doMC_1.3.3      iterators_1.0.7 foreach_1.4.2   RStudioAMI_0.2 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.1         magrittr_1.5        splines_3.2.1       MASS_7.3-41         munsell_0.4.2       colorspace_1.2-6   
 [7] minqa_1.2.4         car_2.1-0           stringr_1.0.0       plyr_1.8.3          tools_3.2.1         pbkrtest_0.4-2     
[13] nnet_7.3-9          grid_3.2.1          gtable_0.1.2        nlme_3.1-120        mgcv_1.8-6          quantreg_5.19      
[19] MatrixModels_0.4-1  gtools_3.5.0        lme4_1.1-9          digest_0.6.8        Matrix_1.2-0        nloptr_1.0.4       
[25] reshape2_1.4.1      codetools_0.2-11    stringi_0.5-5       BradleyTerry2_1.0-6 scales_0.3.0        stats4_3.2.1       
[31] SparseM_1.7         brglm_0.5-9         proto_0.3-10

更新1: 我的代码如下:

library(doMC) ; registerDoMC(cores=4)
library(caret)
classification_formula <- as.formula(paste("target" ,"~",
                                             paste(names(m_input_data)[!names(m_input_data)=='target'],collapse="+")))

CVfolds <- 2
CVreps  <- 5
ma_control <- trainControl(method = "repeatedcv",
                             number = CVfolds,
                             repeats = CVreps ,
                             returnResamp = "final" ,
                             classProbs = T,
                             summaryFunction = twoClassSummary,
                             allowParallel = TRUE,verboseIter = TRUE)
 rf_tuneGrid = expand.grid(mtry = seq(2,32, length.out = 6))
 rf <- train(classification_formula , data = m_input_data , method = "rf", metric="ROC" ,trControl = ma_control, tuneGrid = rf_tuneGrid , ntree = 101)

更新2: 当我从命令行运行时，只有一个核心正在工作当我从 Rstudio 运行这些脚本时，并行正在工作，因为我看到 4 通过 top 进行处理。但一秒钟后错误发生了:

  Error in names(resamples) <- gsub("^\\.", "", names(resamples)) : 
   attempt to set an attribute on NULL

更新4:

您好，问题似乎出在已终止的 R session 中。每次启动 AWS 实例时，我都会运行 R 代码，现在刷新 R 引擎。现在，每次刷新 Rstudio 浏览器时，我都会执行 Session -> Restart R 。看来它运行了。我现在正在检查从 Ubuntu 命令行运行脚本是否相同。

一般情况下它会运行而没有完成。插入符号在数据级别上并行。这意味着它能够在不同的进程上处理每个重采样。但如果样本仍然很大(100,000/2(折叠数 = 2)X 2,000 个特征)，这对于每个处理器单元来说可能很难完成。我说得对吗？

我认为并行性必须在算法级别。这意味着每个算法都可能在多个内核上运行。如果这样的算法实现在插入符号中可用???

最佳答案

我有 Linux 平台的最新版本，R 版本 3.2.2(2015-08-14，消防安全)，并且并行化工作正常。您能否提供不能并行工作的代码。

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8    LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] kernlab_0.9-22  doMC_1.3.3      iterators_1.0.7 foreach_1.4.2   caret_6.0-52    ggplot2_1.0.1   lattice_0.20-33

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.0         compiler_3.2.2      nloptr_1.0.4        plyr_1.8.3          tools_3.2.2         digest_0.6.8       
 [7] lme4_1.1-9          nlme_3.1-122        gtable_0.1.2        mgcv_1.8-7          Matrix_1.2-2        brglm_0.5-9        
[13] SparseM_1.7         proto_0.3-10        BradleyTerry2_1.0-6 stringr_1.0.0       gtools_3.5.0        MatrixModels_0.4-1 
[19] stats4_3.2.2        grid_3.2.2          nnet_7.3-10         minqa_1.2.4         reshape2_1.4.1      car_2.0-26         
[25] magrittr_1.5        scales_0.3.0        codetools_0.2-11    MASS_7.3-43         splines_3.2.2       pbkrtest_0.4-2     
[31] colorspace_1.2-6    quantreg_5.18       stringi_0.5-5       munsell_0.4.2

我已在本地计算机上将您的代码用于 BreastCancer 数据集，并且它可以并行运行，没有任何问题。我使用的是 RStudio 版本 0.98.1103。

library(caret)
library(mlbench)
data(BreastCancer)

library(doMC)  
registerDoMC(cores=2)

classification_formula <- as.formula(paste("Class" ,"~",
                                         paste(names(BreastCancer)[!names(BreastCancer)=='Class'],collapse="+")))

CVfolds <- 2
CVreps  <- 5
ma_control <- trainControl(method = "repeatedcv",
                           number = CVfolds,
                           repeats = CVreps ,
                           returnResamp = "final" ,
                           classProbs = T,
                           summaryFunction = twoClassSummary,
                           allowParallel = TRUE,verboseIter = TRUE)

rf_tuneGrid = expand.grid(mtry = seq(2,32, length.out = 6))

#Notice, it might be easier just to use Class~. 
#instead of classification_formula
rf <- train(classification_formula , 
            data = BreastCancer , 
            method = "rf", 
            metric="ROC" ,
            trControl = ma_control, 
            tuneGrid = rf_tuneGrid , 
            ntree = 101)

> rf
Random Forest 

699 samples
 10 predictors
  2 classes: 'benign', 'malignant' 

No pre-processing
Resampling: Cross-Validated (2 fold, repeated 5 times) 
Summary of sample sizes: 341, 342, 342, 341, 342, 341, ... 
Resampling results across tuning parameters:

 mtry  ROC        Sens       Spec       ROC SD       Sens SD      Spec SD    
   2    0.9867820  1.0000000  0.0000000  0.005007691  0.000000000  0.000000000
   8    0.9899107  0.9549550  0.9640196  0.002243649  0.006714919  0.017247716
  14    0.9907072  0.9558559  0.9631933  0.003028258  0.012345228  0.008019979
  20    0.9909514  0.9635135  0.9556513  0.003268291  0.006864342  0.010471005
  26    0.9911480  0.9630631  0.9539706  0.003384987  0.005113930  0.010628533
  32    0.9911485  0.9657658  0.9522969  0.002973508  0.004842197  0.004090206

ROC was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 32. 
>

关于r-caret - 插入符不并行运行，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32514370/

r-caret - 插入符不并行运行

上一篇：ruby-on-rails - rails : Creating a Multiple Model Form over n association levels

下一篇：php - 选择元素时将数组视为循环数组 - PHP