python - 一方面 KFlold 与 KFold 之间的差异,另一方面 shuffle=True 和 RepeatedKFold 在 sklearn 中

标签 python scikit-learn cross-validation

我正在使用 sklearn 0.22 版比较 KFlold 和 RepeatedKFold。
根据documentation : RepeatedKFold “在每次重复中以不同的随机化重复 K-Fold n 次。”人们会期望运行只有 1 次重复 (n_repeats = 1) 的 RepeatedKFold 的结果与 KFold 几乎相同。

我进行了一个简单的比较:

import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import StratifiedKFold, KFold, RepeatedKFold, RepeatedStratifiedKFold
from sklearn import metrics

X, y = load_digits(return_X_y=True)

classifier = SGDClassifier(loss='hinge', penalty='elasticnet',  fit_intercept=True)
scorer = metrics.accuracy_score
results = []
n_splits = 5
kf = KFold(n_splits=n_splits)
for train_index, test_index in kf.split(X, y):
    x_train, y_train = X[train_index], y[train_index]
    x_test, y_test = X[test_index], y[test_index]
    classifier.fit(x_train, y_train)
    results.append(scorer(y_test, classifier.predict(x_test)))
print ('KFold')
print('mean = ', np.mean(results))
print('std = ', np.std(results))
print()

results = []
n_repeats = 1
rkf = RepeatedKFold(n_splits=n_splits, n_repeats = n_repeats)
for train_index, test_index in rkf.split(X, y):
    x_train, y_train = X[train_index], y[train_index]
    x_test, y_test = X[test_index], y[test_index]
    classifier.fit(x_train, y_train)
    results.append(scorer(y_test, classifier.predict(x_test)))
print ('RepeatedKFold')
print('mean = ', np.mean(results))
print('std = ', np.std(results))

输出是
KFold
mean =  0.9082079851439182
std =  0.04697225962068869

RepeatedKFold
mean =  0.9493562364593006
std =  0.017732595698953055

我重复了这个实验足够多的次数,以查看差异在统计上是显着的。

我试图阅读并重新阅读文档,看看我是否遗漏了什么,但无济于事。

顺便说一句,对于 StratifiedKFold 和 RepeatedStratifiedKFold 也是如此:
StratifiedKFold
mean =  0.9159935004642525
std =  0.026687786392525545

RepeatedStratifiedKFold
mean =  0.9560476632621479
std =  0.014405630805910506

对于这个数据集,StratifiedKFold 同意 KFold; RepeatedStratifiedKFold 同意 RepeatedSKFold。

UPDATE Following the suggestion from @Dan and @SergeyBushmanov, I included shuffle and random_state



def run_nfold(X,y, classifier, scorer, cv,  n_repeats):
    results = []
    for n in range(n_repeats):
        for train_index, test_index in cv.split(X, y):
            x_train, y_train = X[train_index], y[train_index]
            x_test, y_test = X[test_index], y[test_index]
            classifier.fit(x_train, y_train)
            results.append(scorer(y_test, classifier.predict(x_test)))    
    return results
kf = KFold(n_splits=n_splits)
results_kf = run_nfold(X,y, classifier, scorer, kf, 10)
print('KFold mean = ', np.mean(results_kf))

kf_shuffle = KFold(n_splits=n_splits, shuffle=True, random_state = 11)
results_kf_shuffle = run_nfold(X,y, classifier, scorer, kf_shuffle, 10)
print('KFold Shuffled mean = ', np.mean(results_kf_shuffle))

rkf = RepeatedKFold(n_splits=n_splits, n_repeats = n_repeats, random_state = 111)
results_kf_repeated = run_nfold(X,y, classifier, scorer, rkf, 10)
print('RepeatedKFold mean = ', np.mean(results_kf_repeated)

产生
KFold mean =  0.9119255648406066
KFold Shuffled mean =  0.9505304859176724
RepeatedKFold mean =  0.950754100897555

此外,使用 Kolmogorov-Smirnov 检验:

print ('Compare KFold with KFold shuffled results')
ks_2samp(results_kf, results_kf_shuffle)
print ('Compare RepeatedKFold with KFold shuffled results')
ks_2samp(results_kf_repeated, results_kf_shuffle)

显示 KFold shuffled 和 RepeatedKFold(看起来它默认是 shuffled,你是对的 @Dan)在统计上是相同的,而默认非 shuffled KFold 产生统计上显着较低的结果:
Compare KFold with KFold shuffled results
Ks_2sampResult(statistic=0.66, pvalue=1.3182765881237494e-10)

Compare RepeatedKFold with KFold shuffled results
Ks_2sampResult(statistic=0.14, pvalue=0.7166468440414822)

现在,请注意我使用了 不同 KFold 和 RepeatedKFold 的 random_state。因此,答案,或者更确切地说是部分答案,结果的差异是由于改组与非改组造成的。这是有道理的,因为使用不同的 random_state 可以改变精确的分割,它不应该改变统计属性,比如多次运行的平均值。

我现在对为什么改组会导致这种效果感到困惑。我已经更改了问题的标题以反射(reflect)这种混淆(我希望它不会违反任何 stackoverflow 规则,但我不想创建另一个问题)。

UPDATE I agree with @SergeyBushmanov's suggestion. I posted it as a new question

最佳答案

使RepeatedKFold结果类似于 KFold你必须:

np.random.seed(42)
n = np.random.choice([0,1],10,p=[.5,.5])
kf = KFold(2,shuffle=True, random_state=42)
list(kf.split(n))
[(array([2, 3, 4, 6, 9]), array([0, 1, 5, 7, 8])),
 (array([0, 1, 5, 7, 8]), array([2, 3, 4, 6, 9]))]
kfr = RepeatedKFold(n_splits=2, n_repeats=1, random_state=42)
list(kfr.split(n))
[(array([2, 3, 4, 6, 9]), array([0, 1, 5, 7, 8])),
 (array([0, 1, 5, 7, 8]), array([2, 3, 4, 6, 9]))]
RepeatedKFold uses KFold要生成折叠,您只需要确保两者具有相似的 random_state .

关于python - 一方面 KFlold 与 KFold 之间的差异,另一方面 shuffle=True 和 RepeatedKFold 在 sklearn 中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60492704/

相关文章:

machine-learning - 参数选择和k折交叉验证

python - 当我在 Python 中打印 utf-8 文件中的文本时,为什么看不到希伯来字符?

python - pandas 长到宽的多列 reshape

python - 优雅地完成对两个不等长数组的并行操作

machine-learning - SKlearn 中嵌套交叉验证的分类报告(平均值/个体值)

python - 映射 - 特征重要性与标签分类

r - 保持 h2o.automl 的交叉验证预测和折叠分配

python - 无法查询 ListField(EmbeddedDocumentField)

python - 在 sklearn 的 .fit() 方法中使用 numpy.ndarray 与 Pandas Dataframe

r - R 中出现 "Variable Lengths Differ"错误的原因是什么?