r - 依赖于使用插入符号进行预处理的单个袋装树模型的预测

标签 r tree r-caret

我正在使用 caret 包通过方法 treebag 预测时间序列。 caret 估计具有 25 个引导复制的装袋回归树。

我很难理解的是“treebag 模型”的最终预测与 25 棵树中每棵树所做的预测有何关系,这取决于我是否使用 caret::preProcess,还是不是。

我知道 this question以及其中的链接资源。 (但无法从中得出正确的结论。)

这是一个使用经济数据的例子。假设我想预测必须先创建的 unemploy_rate

# packages
library(caret)
library(tidyverse)

# data
data("economics")

economics$unemploy_rate <- economics$unemploy / economics$pop * 100
x <- economics[, -c(1, 7)]
y <- economics[["unemploy_rate"]]

我编写了一个函数,从 train 对象中提取 25 棵树,对每棵树进行预测,对这 25 个预测进行平均,并将此平均值与 train 的预测进行比较 对象。它返回一个图。

predict_from_treebag <- function(model) {
  # extract 25 trees from train object
  bagged_trees <- map(.x = model$finalModel$mtrees, .f = pluck, "btree")

  # make a prediction for each tree
  pred_trees <- map(bagged_trees, .f = predict, newdata = x)
  names(pred_trees) <- paste0("tree_", seq_along(pred_trees))

  # aggreagte predictions
  pred_trees <- as.data.frame(pred_trees) %>%
    add_column(date = economics$date, .before = 1) %>%
    gather(tree, value, matches("^tree")) %>%
    group_by(date) %>%
    mutate(mean_pred_from_trees = mean(value)) %>%
    ungroup()

  # add prediction from train object
  pred_trees$bagging_model_prediction = predict(model, x)
  pred_trees <- pred_trees %>%
    gather(model, pred_value, 4:5)

  # plot
  p <- ggplot(data = pred_trees, aes(date)) +
        geom_line(aes(y = value, group = tree), alpha = .2) +
        geom_line(aes(y = pred_value, col = model)) +
        theme_minimal() +
        theme(
         panel.grid.major = element_blank(),
         panel.grid.minor = element_blank(),
         legend.position = "bottom"
        )

  p

}

现在我估计两个模型,第一个是未缩放的,第二个是居中和缩放的。

preproc_opts <- list(unscaled = NULL,
                     scaled = c("center", "scale"))

# estimate the models
models <- map(preproc_opts, function(preproc)
    train(
    x = x,
    y = y,
    trControl = trainControl(method = "none"), # since there are no tuning parameters for this model
    metric = "RMSE",
    method = "treebag",
    preProcess = preproc
))

# apply predict_from_treebag to each model
imap(.x = models,
     .f = ~{predict_from_treebag(.x) + labs(title = .y)})

结果如下所示。未缩放的模型预测是 25 棵树的平均值,但是当我使用 preProcess 时,为什么 25 棵树的每个预测都是常数?

感谢您对我可能出错的任何建议。

enter image description here

enter image description here

最佳答案

问题出在这部分代码:

pred_trees <- map(bagged_trees, .f = predict, newdata = x)

在函数 predict_from_treebag

这个predict函数实际上是predict.rpart因为

class(bagged_trees[[1]])

predict.rpart 不知道您对插入符号中的数据进行了预处理。

这是一个快速修复:

predict_from_treebag <- function(model) {
  # extract 25 trees from train object
  bagged_trees <- map(.x = model$finalModel$mtrees, .f = pluck, "btree")
  x <- economics[, -c(1, 7)]
  # make a prediction for each tree

  newdata = if(is.null(model$preProcess)) x else predict(model$preProcess, x)
  pred_trees <- map(bagged_trees, .f = predict, newdata = newdata)
  names(pred_trees) <- paste0("tree_", seq_along(pred_trees))

  # aggreagte predictions
  pred_trees <- as.data.frame(pred_trees) %>%
    add_column(date = economics$date, .before = 1) %>%
    gather(tree, value, matches("^tree")) %>%
    group_by(date) %>%
    mutate(mean_pred_from_trees = mean(value)) %>%
    ungroup()

  # add prediction from train object
  pred_trees$bagging_model_prediction = predict(model, x)
  pred_trees <- pred_trees %>%
    gather(model, pred_value, 4:5)

  # plot
  p <- ggplot(data = pred_trees, aes(date)) +
    geom_line(aes(y = value, group = tree), alpha = .2) +
    geom_line(aes(y = pred_value, col = model)) +
    theme_minimal() +
    theme(
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      legend.position = "bottom"
    )

  p
}

现在运行后:

preproc_opts <- list(unscaled = NULL,
                     scaled = c("center", "scale"))

models <- map(preproc_opts, function(preproc)
  train(
    x = x,
    y = y,
    trControl = trainControl(method = "none"), # since there are no tuning parameters for this model
    metric = "RMSE",
    method = "treebag",
    preProcess = preproc
  ))

map2(.x = models,
     .y = names(models),
     .f = ~{predict_from_treebag(.x) + labs(title = .y)})

结果符合预期

enter image description here enter image description here

关于r - 依赖于使用插入符号进行预处理的单个袋装树模型的预测,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48111061/

相关文章:

r - 加载插入符号包时出现错误 : Package "ggplot2" could not be found,

r - 根据命名向量 (R) 更改列名

r - 加快 Julia 写得不好的 R 示例的速度

R:删除字符串中的部分单词

algorithm - 多次遍历树时,如何计算任意节点被访问的最大次数?

algorithm - 按字典序生成所有N个节点的二叉树

R Caret 的时间片 - 窗口和地平线不清楚

r - 如何在 R 中实现保留验证

r - 如何计算 R 中大数的校验位?

algorithm - 国际奥委会 2003 : how to calculate the node that has the minimum balance in a tree?