首先,我将数据集分为训练和测试,例如:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=999)
然后,我使用 GridSearchCV
进行交叉验证来找到性能最佳的模型:
validator = GridSearchCV(estimator=clf, param_grid=param_grid, scoring="accuracy", cv=cv)
通过这样做,我有:
A model is trained using k-1 of the folds as training data; the resulting model is validated on the remaining part of the data (scikit-learn.org)
但是,当阅读有关 Keras fit
功能时,该文档又引入了 2 个术语:
validation_split: Float between 0 and 1. Fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. The validation data is selected from the last samples in the x and y data provided, before shuffling.
validation_data: tuple (x_val, y_val) or tuple (x_val, y_val, val_sample_weights) on which to evaluate the loss and any model metrics at the end of each epoch. The model will not be trained on this data. validation_data will override validation_split.
据我了解,validation_split
(将被validation_data
覆盖)将用作未更改的验证数据集,同时交叉验证中的保留集在每个交叉验证步骤中都会发生变化。
- 第一个问题:由于我已经进行了交叉验证,因此是否有必要使用
validation_split
或validation_data
? 第二个问题:如果没有必要,那么我是否应该将
validation_split
和validation_data
分别设置为 0 和 None?grid_result = validator.fit(train_images, train_labels, validation_data=None, validation_split=0)
问题 3:如果我这样做,训练期间会发生什么,Keras 会简单地忽略验证步骤吗?
问题 4:
validation_split
是否属于k-1 折叠
或hold-out 折叠
,或者是否会被视为“测试集”(如交叉验证
的情况),永远不会用于训练模型。
最佳答案
执行验证是为了确保模型不会过度拟合数据集并且可以推广到新数据。由于在参数网格搜索中您还进行验证,因此在训练期间无需由 Keras 模型本身执行验证步骤。因此回答您的问题:
is it necessary to use validation_split or validation_data since I already do cross validation?
不,正如我上面提到的。
if it is not necessary, then should I set validation_split and validation_data to 0 and None, respectively?
否,因为默认情况下 Keras 中不进行任何验证(即默认情况下,fit()
方法中我们有 validation_split=0.0,validation_data=None
)。
If I do so, what will happen during the training, would Keras just simply ignore the validation step?
是的,Keras 在训练模型时不会执行验证。但请注意,正如我上面提到的,网格搜索过程将执行验证,以更好地估计具有特定参数集的模型的性能。
关于machine-learning - Keras中进行交叉验证和validation_data/validation_split之间的区别,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53190016/