r - 依赖于使用插入符号进行预处理的单个袋装树模型的预测

我正在使用 caret 包通过方法 treebag 预测时间序列。 caret 估计具有 25 个引导复制的装袋回归树。

我很难理解的是“treebag 模型”的最终预测与 25 棵树中每棵树所做的预测有何关系，这取决于我是否使用 caret::preProcess，还是不是。

我知道 this question以及其中的链接资源。 (但无法从中得出正确的结论。)

这是一个使用经济数据的例子。假设我想预测必须先创建的 unemploy_rate。

# packages
library(caret)
library(tidyverse)

# data
data("economics")

economics$unemploy_rate <- economics$unemploy / economics$pop * 100
x <- economics[, -c(1, 7)]
y <- economics[["unemploy_rate"]]

我编写了一个函数，从 train 对象中提取 25 棵树，对每棵树进行预测，对这 25 个预测进行平均，并将此平均值与 train 的预测进行比较 对象。它返回一个图。

predict_from_treebag <- function(model) {
  # extract 25 trees from train object
  bagged_trees <- map(.x = model$finalModel$mtrees, .f = pluck, "btree")

  # make a prediction for each tree
  pred_trees <- map(bagged_trees, .f = predict, newdata = x)
  names(pred_trees) <- paste0("tree_", seq_along(pred_trees))

  # aggreagte predictions
  pred_trees <- as.data.frame(pred_trees) %>%
    add_column(date = economics$date, .before = 1) %>%
    gather(tree, value, matches("^tree")) %>%
    group_by(date) %>%
    mutate(mean_pred_from_trees = mean(value)) %>%
    ungroup()

  # add prediction from train object
  pred_trees$bagging_model_prediction = predict(model, x)
  pred_trees <- pred_trees %>%
    gather(model, pred_value, 4:5)

  # plot
  p <- ggplot(data = pred_trees, aes(date)) +
        geom_line(aes(y = value, group = tree), alpha = .2) +
        geom_line(aes(y = pred_value, col = model)) +
        theme_minimal() +
        theme(
         panel.grid.major = element_blank(),
         panel.grid.minor = element_blank(),
         legend.position = "bottom"
        )

  p

}

现在我估计两个模型，第一个是未缩放的，第二个是居中和缩放的。

preproc_opts <- list(unscaled = NULL,
                     scaled = c("center", "scale"))

# estimate the models
models <- map(preproc_opts, function(preproc)
    train(
    x = x,
    y = y,
    trControl = trainControl(method = "none"), # since there are no tuning parameters for this model
    metric = "RMSE",
    method = "treebag",
    preProcess = preproc
))

# apply predict_from_treebag to each model
imap(.x = models,
     .f = ~{predict_from_treebag(.x) + labs(title = .y)})

结果如下所示。未缩放的模型预测是 25 棵树的平均值，但是当我使用 preProcess 时，为什么 25 棵树的每个预测都是常数？

感谢您对我可能出错的任何建议。

最佳答案

问题出在这部分代码:

pred_trees <- map(bagged_trees, .f = predict, newdata = x)

在函数 predict_from_treebag 中

这个predict函数实际上是predict.rpart因为

class(bagged_trees[[1]])

predict.rpart 不知道您对插入符号中的数据进行了预处理。

这是一个快速修复:

predict_from_treebag <- function(model) {
  # extract 25 trees from train object
  bagged_trees <- map(.x = model$finalModel$mtrees, .f = pluck, "btree")
  x <- economics[, -c(1, 7)]
  # make a prediction for each tree

  newdata = if(is.null(model$preProcess)) x else predict(model$preProcess, x)
  pred_trees <- map(bagged_trees, .f = predict, newdata = newdata)
  names(pred_trees) <- paste0("tree_", seq_along(pred_trees))

  # aggreagte predictions
  pred_trees <- as.data.frame(pred_trees) %>%
    add_column(date = economics$date, .before = 1) %>%
    gather(tree, value, matches("^tree")) %>%
    group_by(date) %>%
    mutate(mean_pred_from_trees = mean(value)) %>%
    ungroup()

  # add prediction from train object
  pred_trees$bagging_model_prediction = predict(model, x)
  pred_trees <- pred_trees %>%
    gather(model, pred_value, 4:5)

  # plot
  p <- ggplot(data = pred_trees, aes(date)) +
    geom_line(aes(y = value, group = tree), alpha = .2) +
    geom_line(aes(y = pred_value, col = model)) +
    theme_minimal() +
    theme(
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      legend.position = "bottom"
    )

  p
}

现在运行后:

preproc_opts <- list(unscaled = NULL,
                     scaled = c("center", "scale"))

models <- map(preproc_opts, function(preproc)
  train(
    x = x,
    y = y,
    trControl = trainControl(method = "none"), # since there are no tuning parameters for this model
    metric = "RMSE",
    method = "treebag",
    preProcess = preproc
  ))

map2(.x = models,
     .y = names(models),
     .f = ~{predict_from_treebag(.x) + labs(title = .y)})

结果符合预期

关于r - 依赖于使用插入符号进行预处理的单个袋装树模型的预测，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48111061/

r - 依赖于使用插入符号进行预处理的单个袋装树模型的预测

上一篇：google-chrome - 为什么谷歌浏览器会弄乱我的形象

下一篇：functional-programming - 如何在纯函数式编程中执行副作用？