随机森林模型的 RMSE 误差

标签 r machine-learning random-forest rpart

我正在尝试训练随机森林模型,但出现以下错误。我需要对分类模型使用不同的设置来解决 RMSE 问题吗?我尝试将“好”转换为一个因素,但这引发了一个新错误。

错误:

Error in train.default(x, y, weights = w, ...) : 
  Metric RMSE not applicable for classification models 
5 stop(paste("Metric", metric, "not applicable for classification models")) 
4 train.default(x, y, weights = w, ...) 
3 train(x, y, weights = w, ...) 
2 train.formula(good ~ ., data = train, method = "rf", trControl = trainControl(method = "cv", 
    5), ntree = 251) 
1 train(like ~ ., data = train, method = "rf", trControl = trainControl(method = "cv", 
    5), ntree = 251) 

我用来训练模型的代码如下。我尝试根据变量 1-3 中的值将数据集中的记录分类为“良好”。

代码:

set.seed(13518) # For reproducibile purpose
inTrain <- createDataPartition(SampleTestData$good, p=0.70, list=F)
train <- SampleTestData[inTrain, ]
test_train <- SampleTestData[-inTrain, ]

if(!exists("model1"))
{
  model1 <- train(good ~ ., data=train, method="rf", trControl=trainControl(method="cv", 5), ntree=251)
}

我在下面提供了一些示例数据。我使用 dput 将数据输出为下面的文本。

数据:

structure(list(good = c("True", "True", "True", "False", "False", 
"True", "True", "True", "True", "False", "True", "True", "True", 
"True", "False", "False", "False", "True", "False", "False", 
"True", "False", "True", "False", "True", "False", "True", "True", 
"False", "False", "True", "True", "False", "True", "True", "True", 
"True", "False", "False", "False", "False", "True", "False", 
"True", "True", "True", "False", "True", "False", "True", "False", 
"True", "True", "True", "False", "False", "True", "False", "True"
), variable1 = c("TRUE", "TRUE", "TRUE", "TRUE", 
"FALSE", "TRUE", "FALSE", "TRUE", "TRUE", "FALSE", "FALSE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "FALSE", "TRUE", "FALSE", "TRUE", 
"TRUE", "FALSE", "TRUE", "TRUE", "TRUE", "FALSE", "FALSE", "FALSE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "FALSE", "FALSE", 
"TRUE", "TRUE", "TRUE", "FALSE", "TRUE", "TRUE", "FALSE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "FALSE", "TRUE"), variable2 = c("TRUE", 
"TRUE", "TRUE", "TRUE", "FALSE", "TRUE", "TRUE", "TRUE", "TRUE", 
"FALSE", "TRUE", "TRUE", "TRUE", "FALSE", "TRUE", "TRUE", "FALSE", 
"TRUE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "TRUE", "TRUE", 
"FALSE", "FALSE", "FALSE", "TRUE", "TRUE", "FALSE", "TRUE", "TRUE", 
"TRUE", "TRUE", "FALSE", "FALSE", "TRUE", "TRUE", "FALSE", "FALSE", 
"TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "FALSE", 
"TRUE", "TRUE", "FALSE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", 
"FALSE", "FALSE", "TRUE"), variable3 = c("FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "FALSE", 
"TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"TRUE")), .Names = c("good", "variable1", "variable2", 
"variable3"), class = "data.frame", row.names = c(5078L, 
5087L, 5366L, 5568L, 7017L, 8123L, 8145L, 8525L, 11777L, 12355L, 
12586L, 12675L, 14912L, 15503L, 15530L, 15533L, 15598L, 15634L, 
15749L, 15842L, 16216L, 16718L, 16744L, 16792L, 17928L, 20351L, 
20417L, 21083L, 22382L, 23698L, 23807L, 23879L, 23900L, 30431L, 
30897L, 31084L, 31803L, 32007L, 32806L, 37487L, 37656L, 38284L, 
38291L, 38471L, 38786L, 40303L, 40724L, 41222L, 41248L, 41837L, 
42994L, 44423L, 45216L, 46233L, 47012L, 50446L, 52429L, 53197L, 
54590L))

最佳答案

将good转换为一个因素实际上似乎可以解决问题。数据集中的所有变量的值为 TRUE 或 FALSE,并且都是字符类型。那么为什么在这种情况下随机森林默认使用回归而不是分类器呢?

解决问题的代码:

SampleTestData$good = as.factor(SampleTestData$good)

关于随机森林模型的 RMSE 误差,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33334883/

相关文章:

python - 邮件发送时间优化

python - TensorFlow 的可微分汉明损失

pyspark - 使用 Pyspark 从 Spark DataFrame 创建标记点

r - 如何手动设置线性模型中变量的系数?

r - RStudio 和 R 中的运算符 "[<-"

r - data.table 错误 : lapply on . SD 在使用 get() 时对列重新排序。可能的解决方法?

html - 将文本定位在 R shiny 中的 Action Button 中

tensorflow - tf.multinomial 如何工作?

r - 错误: predictors in new data do not match that of the training data when using raster attribute table (RAT)

machine-learning - xgboost 的多输出回归