machine-learning - 参数选择和k折交叉验证

我有一个数据集，需要对整个数据集进行交叉验证，例如10倍交叉验证。我想使用带有参数选择的径向基函数 (RBF) 内核(RBF 内核有两个参数:C 和 gamma)。通常，人们使用开发集来选择SVM的超参数，然后根据开发集使用最佳的超参数并将其应用于测试集进行评估。然而，在我的例子中，原始数据集被划分为 10 个子集。随后使用在其余 9 个子集上训练的分类器来测试一个子集。显然我们没有固定的训练和测试数据。这种情况下应该如何进行超参数选择呢？

最佳答案

您的数据是否出于特定原因准确地划分为那 10 个分区？如果没有，您可以再次将它们连接/打乱在一起，然后进行常规(重复)交叉验证以执行参数网格搜索。例如，使用 10 个分区和 10 次重复，总共得到 100 个训练和评估集。这些现在用于训练和评估所有参数集，因此您将获得您尝试过的每个参数集 100 个结果。然后可以根据每组 100 个结果计算每个参数集的平均性能。

这个过程已经内置在大多数机器学习工具中，就像 R 中的这个简短示例一样，使用 caret 库:

library(caret)
library(lattice)
library(doMC)
registerDoMC(3)

model <- train(x = iris[,1:4], 
            y = iris[,5], 
            method = 'svmRadial', 
            preProcess = c('center', 'scale'),
            tuneGrid = expand.grid(C=3**(-3:3), sigma=3**(-3:3)), # all permutations of these parameters get evaluated
            trControl = trainControl(method = 'repeatedcv', 
                                        number = 10, 
                                        repeats = 10, 
                                        returnResamp = 'all', # store results of all parameter sets on all partitions and repeats
                                        allowParallel = T))

# performance of different parameter set (e.g. average and standard deviation of performance)
print(model$results) 
# visualization of the above
levelplot(x = Accuracy~C*sigma, data = model$results, col.regions=gray(100:0/100), scales=list(log=3)) 
# results of all parameter sets over all partitions and repeats. From this the metrics above get calculated
str(model$resample)

评估超参数网格后，您可以选择合理的参数集(“模型选择”，例如通过选择性能良好但仍合理的复杂模型)。

顺便说一句:如果可能的话，我建议重复交叉验证而不是交叉验证(最终使用超过 10 次重复，但详细信息取决于您的问题)；正如 @christian-cerri 已经建议的那样，拥有一个额外的、看不见的测试集来评估最终模型在新数据上的性能是一个好主意。

关于machine-learning - 参数选择和k折交叉验证，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37063838/

machine-learning - 参数选择和k折交叉验证

上一篇：machine-learning - 使用负采样实现 word2vec

下一篇：encoding - 遗传算法中使用值编码方法时如何交叉 parent ？