我正在使用 R 中的 gbm 包为以下模型拟合 BRT 模型:
离地高度~年龄+季节+栖息地+一天中的时间
离地高度是一个连续变量,一天中的时间也是如此。季节和栖息地是二项式变量。
我的偏差非常高,但我不知道为什么...... 有人可以帮我设置参数吗?
> M1 <- gbm.step(data=data, gbm.x = 2:5, gbm.y = 1,
+ family = "gaussian", tree.complexity = 4,
+ learning.rate = 0.01, bag.fraction = 0.50,
+ tolerance.method = "fixed",
+ tolerance = 0.01)
GBM STEP - version 2.9
Performing cross-validation optimisation of a boosted regression tree model
for HAG and using a family of gaussian
Using 15439 observations and 4 predictors
creating 10 initial models of 50 trees
folds are unstratified
total mean deviance = 55368.22
tolerance is fixed at 0.01
ntrees resid. dev.
50 51050.65
now adding trees...
100 48935.65
150 47805.14
200 47193.43
250 46841.71
300 46631.33
350 46498.56
400 46418.58
450 46371.7
500 46336.54
550 46317.53
600 46309.25
650 46300.57
700 46296.82
750 46297
800 46299.11
850 46297.7
900 46298.34
950 46292.32
1000 46297.62
1050 46295.78
1100 46301.32
1150 46306.59
1200 46312.55
1250 46314.67
1300 46318.64
1350 46321.38
1400 46324.33
1450 46322.9
fitting final gbm model with a fixed number of 950 trees for HAG
mean total deviance = 55368.21
mean residual deviance = 45913.34
estimated cv deviance = 46292.32 ; se = 1366.501
training data correlation = 0.413
cv correlation = 0.406 ; se = 0.008
elapsed time - 0.02 minutes
最佳答案
gbm 中的偏差是均方误差,它将取决于因变量所处的范围。
例如:
library(dismo)
library(mlbench)
data(BostonHousing)
idx=sample(nrow(BostonHousing),400)
TrnData = BostonHousing[idx,]
TestData = BostonHousing[-idx,]
因变量是最后一列“medv”,因此我们对原始数据运行 gbm:
gbm_0 = gbm.step(data=TrnData,gbm.x=1:13,gbm.y=14,family="gaussian")
mean total deviance = 84.02
mean residual deviance = 7.871
estimated cv deviance = 13.959 ; se = 1.909
training data correlation = 0.952
cv correlation = 0.916 ; se = 0.012
您可以看到平均偏差也可以根据残差计算(即 y - y 预测):
mean(gbm_0$residuals^2)
[1] 7.871158
使用 testData(模型尚未接受过训练)总是好的。您还可以使用相关性或 MAE(平均绝对误差)检查它与实际数据的接近程度:
pred = predict(gbm_0,TestData,1000)
# or pearson if you like
cor(pred,TestData$medv,method="spearman")
[1] 0.8652737
# MAE
mean(abs(TestData$medv-pred))
[1] 2.75325
想象一下,良好的相关性意味着您的预测平均偏差 3。
现在,如果您更改因变量的规模,则根据您对相关性或 MAE 的解释而产生的偏差将保持不变:
TrnData$medv = TrnData$medv*2
TestData$medv = TestData$medv*2
gbm_2 = gbm.step(data=TrnData,gbm.x=1:13,gbm.y=14,family="gaussian")
mean total deviance = 336.081
mean residual deviance = 30.983
estimated cv deviance = 57.52 ; se = 10.254
training data correlation = 0.953
cv correlation = 0.911 ; se = 0.019
elapsed time - 0.2 minutes
pred = predict(gbm_2,TestData,1000)
cor(pred,TestData$medv,method="spearman")
[1] 0.8676821
mean(abs(TestData$medv-pred))
[1] 5.47673
关于r - 增强回归树 - 偏差值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60488587/