r - glm() 模型的交叉验证

我正在尝试对我之前在 R 中构建的一些 glm 模型进行 10 倍交叉验证。我对 cv.glm() 有点困惑。 boot 中的函数包，虽然我已经阅读了很多帮助文件。当我提供以下公式时:

library(boot)
cv.glm(data, glmfit, K=10)

这里的“数据”参数是指整个数据集还是仅指测试集？

到目前为止，我所看到的示例提供了“数据”参数作为测试集，但这并没有真正意义，例如为什么在同一个测试集上进行 10 折？他们都会给出完全相同的结果(我假设!)。

不幸的是?cv.glm以一种模糊的方式解释它:

data: A matrix or data frame containing the data. The rows should be cases and the columns correspond to variables, one of which is the response

我的另一个问题是关于 $delta[1]结果。这是 10 次试验的平均预测误差吗？如果我想获得每个折叠的错误怎么办？

这是我的脚本的样子:

##data partitioning
sub <- sample(nrow(data), floor(nrow(x) * 0.9))
training <- data[sub, ]
testing <- data[-sub, ]

##model building
model <- glm(formula = groupcol ~ var1 + var2 + var3,
        family = "binomial", data = training)

##cross-validation
cv.glm(testing, model, K=10)

最佳答案

对于使用各种包的 10 倍交叉验证方法，我总是有点谨慎。我有自己的简单脚本，可以为任何机器学习包手动创建测试和训练分区:

#Randomly shuffle the data
yourData<-yourData[sample(nrow(yourData)),]

#Create 10 equally size folds
folds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE)

#Perform 10 fold cross validation
for(i in 1:10){
    #Segement your data by fold using the which() function 
    testIndexes <- which(folds==i,arr.ind=TRUE)
    testData <- yourData[testIndexes, ]
    trainData <- yourData[-testIndexes, ]
    #Use test and train data partitions however you desire...
}

关于r - glm() 模型的交叉验证，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/21380236/

r - glm() 模型的交叉验证

上一篇：dart - Dart中使用的包命名约定是什么？

下一篇：google-chrome - Chrome 开发工具控制台完全空白