我正在使用 caret
包通过方法 treebag
预测时间序列。 caret
估计具有 25 个引导复制的装袋回归树。
我很难理解的是“treebag 模型”的最终预测与 25 棵树中每棵树所做的预测有何关系,这取决于我是否使用 caret::preProcess
,还是不是。
我知道 this question以及其中的链接资源。 (但无法从中得出正确的结论。)
这是一个使用经济
数据的例子。假设我想预测必须先创建的 unemploy_rate
。
# packages
library(caret)
library(tidyverse)
# data
data("economics")
economics$unemploy_rate <- economics$unemploy / economics$pop * 100
x <- economics[, -c(1, 7)]
y <- economics[["unemploy_rate"]]
我编写了一个函数,从 train
对象中提取 25 棵树,对每棵树进行预测,对这 25 个预测进行平均,并将此平均值与 train 的预测进行比较
对象。它返回一个图。
predict_from_treebag <- function(model) {
# extract 25 trees from train object
bagged_trees <- map(.x = model$finalModel$mtrees, .f = pluck, "btree")
# make a prediction for each tree
pred_trees <- map(bagged_trees, .f = predict, newdata = x)
names(pred_trees) <- paste0("tree_", seq_along(pred_trees))
# aggreagte predictions
pred_trees <- as.data.frame(pred_trees) %>%
add_column(date = economics$date, .before = 1) %>%
gather(tree, value, matches("^tree")) %>%
group_by(date) %>%
mutate(mean_pred_from_trees = mean(value)) %>%
ungroup()
# add prediction from train object
pred_trees$bagging_model_prediction = predict(model, x)
pred_trees <- pred_trees %>%
gather(model, pred_value, 4:5)
# plot
p <- ggplot(data = pred_trees, aes(date)) +
geom_line(aes(y = value, group = tree), alpha = .2) +
geom_line(aes(y = pred_value, col = model)) +
theme_minimal() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "bottom"
)
p
}
现在我估计两个模型,第一个是未缩放的,第二个是居中和缩放的。
preproc_opts <- list(unscaled = NULL,
scaled = c("center", "scale"))
# estimate the models
models <- map(preproc_opts, function(preproc)
train(
x = x,
y = y,
trControl = trainControl(method = "none"), # since there are no tuning parameters for this model
metric = "RMSE",
method = "treebag",
preProcess = preproc
))
# apply predict_from_treebag to each model
imap(.x = models,
.f = ~{predict_from_treebag(.x) + labs(title = .y)})
结果如下所示。未缩放的模型预测是 25 棵树的平均值,但是当我使用 preProcess
时,为什么 25 棵树的每个预测都是常数?
感谢您对我可能出错的任何建议。
最佳答案
问题出在这部分代码:
pred_trees <- map(bagged_trees, .f = predict, newdata = x)
在函数 predict_from_treebag
中
这个predict
函数实际上是predict.rpart
因为
class(bagged_trees[[1]])
predict.rpart
不知道您对插入符号中的数据进行了预处理。
这是一个快速修复:
predict_from_treebag <- function(model) {
# extract 25 trees from train object
bagged_trees <- map(.x = model$finalModel$mtrees, .f = pluck, "btree")
x <- economics[, -c(1, 7)]
# make a prediction for each tree
newdata = if(is.null(model$preProcess)) x else predict(model$preProcess, x)
pred_trees <- map(bagged_trees, .f = predict, newdata = newdata)
names(pred_trees) <- paste0("tree_", seq_along(pred_trees))
# aggreagte predictions
pred_trees <- as.data.frame(pred_trees) %>%
add_column(date = economics$date, .before = 1) %>%
gather(tree, value, matches("^tree")) %>%
group_by(date) %>%
mutate(mean_pred_from_trees = mean(value)) %>%
ungroup()
# add prediction from train object
pred_trees$bagging_model_prediction = predict(model, x)
pred_trees <- pred_trees %>%
gather(model, pred_value, 4:5)
# plot
p <- ggplot(data = pred_trees, aes(date)) +
geom_line(aes(y = value, group = tree), alpha = .2) +
geom_line(aes(y = pred_value, col = model)) +
theme_minimal() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "bottom"
)
p
}
现在运行后:
preproc_opts <- list(unscaled = NULL,
scaled = c("center", "scale"))
models <- map(preproc_opts, function(preproc)
train(
x = x,
y = y,
trControl = trainControl(method = "none"), # since there are no tuning parameters for this model
metric = "RMSE",
method = "treebag",
preProcess = preproc
))
map2(.x = models,
.y = names(models),
.f = ~{predict_from_treebag(.x) + labs(title = .y)})
结果符合预期
关于r - 依赖于使用插入符号进行预处理的单个袋装树模型的预测,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48111061/