r - 与 randomForest 相比,游侠的错误预测

标签 r random-forest

我正在试用 ranger R包加速做了很多randomForest计算。我正在检查我从中得到的预测,并注意到一些有趣的事情,因为所做的预测完全不正确。

以下是比较 randomForest 的可重现示例和 ranger .

data(iris)
library(randomForest)


iris_spec <- as.factor(iris$Species)
iris_dat <- as.matrix(iris[, !(names(iris) %in% "Species")])

set.seed(1234)

test_index <- sample(nrow(iris), 10)
train_index <- seq(1, nrow(iris))[-test_index]


iris_train <- randomForest(x = iris_dat[train_index, ], y = iris_spec[train_index], keep.forest = TRUE)
iris_pred <- predict(iris_train, iris_dat[test_index, ])

iris_train$confusion


##            setosa versicolor virginica class.error
## setosa         47          0         0  0.00000000
## versicolor      0         42         3  0.06666667
## virginica       0          4        44  0.08333333


cbind(as.character(iris_pred), as.character(iris_spec[test_index]))
##       [,1]         [,2]        
##  [1,] "setosa"     "setosa"    
##  [2,] "versicolor" "versicolor"
##  [3,] "versicolor" "versicolor"
##  [4,] "versicolor" "versicolor"
##  [5,] "virginica"  "virginica" 
##  [6,] "virginica"  "virginica" 
##  [7,] "setosa"     "setosa"    
##  [8,] "setosa"     "setosa"    
##  [9,] "versicolor" "versicolor"
## [10,] "versicolor" "versicolor"


library(ranger)


iris_train2 <- ranger(data = iris[train_index, ], dependent.variable.name = "Species", write.forest = TRUE)
iris_pred2 <- predict(iris_train2, iris[test_index, ])

iris_train2$classification.table


##             true
## predicted    setosa versicolor virginica
##   setosa         47          0         0
##   versicolor      0         41         3
##   virginica       0          4        45


cbind(as.character(iris_pred2$predictions), as.character(iris_spec[test_index]))

##       [,1]         [,2]        
##  [1,] "versicolor" "setosa"    
##  [2,] "virginica"  "versicolor"
##  [3,] "virginica"  "versicolor"
##  [4,] "virginica"  "versicolor"
##  [5,] "virginica"  "virginica" 
##  [6,] "virginica"  "virginica" 
##  [7,] "versicolor" "setosa"    
##  [8,] "versicolor" "setosa"    
##  [9,] "virginica"  "versicolor"
## [10,] "virginica"  "versicolor"


sessionInfo()

## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Fedora 22 (Twenty Two)
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ranger_0.2.7        randomForest_4.6-12
## 
## loaded via a namespace (and not attached):
## [1] magrittr_1.5  formatR_1.2.1 tools_3.2.2   Rcpp_0.12.1   stringi_0.5-5
## [6] knitr_1.11    stringr_1.0.0 evaluate_0.8

如您所见,整体混淆表看起来具有可比性,但对于 ranger 的预测完全不同。 .有没有其他人遇到过这种情况?

最佳答案

这是一个错误。它已在 GitHub 版本中修复(请参阅 https://github.com/mnwright/ranger/issues/6 ),但更改尚未在 CRAN 上进行。我将很快向 CRAN 提交一个新版本。同时,请安装 GitHub 版本:

devtools::install_github("mnwright/ranger/ranger-r-package/ranger")

更新:自 11 月 10 日起修复了 CRAN。

关于r - 与 randomForest 相比,游侠的错误预测,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33349097/

相关文章:

r - 从 R 中经过训练的 randomForest 获取因子水平

machine-learning - 使用不同的数据集训练随机森林会给出完全不同的结果!为什么?

r - 使用 summarise_all [R] 在 dplyr 组内执行 t 检验

r - 将列名称分配给数据框列表

python - 为什么随机森林分类器 .predict() 和 .predict_proba() 的预测不匹配?

python - 机器学习: Getting error in Confusion Matrix

python - 增量拟合sklearn RandomForestClassifier

R插入符(svmRadial)保持sigma不变并使用网格搜索C

r - 局部变量的用法

r - 聚合函数和表函数之间的区别