python - Scikit 学习 : RidgeCV seems not to give the best option?

这是我的X:

 X =  np.array([[  5.,   8.,   3.,   4.,   0.,   5.,   4.,   0.,   2.,   5.,  11.,
              3.,  19.,   2.],
           [  5.,   8.,   3.,   4.,   0.,   1.,   4.,   0.,   3.,   5.,  13.,
              4.,  19.,   2.],
           [  5.,   8.,   3.,   4.,   0.,   4.,   4.,   0.,   3.,   5.,  12.,
              2.,  19.,   2.],
           [  5.,   8.,   3.,   4.,   0.,   1.,   4.,   0.,   4.,   5.,  12.,
              4.,  19.,   2.],
           [  5.,   8.,   3.,   4.,   0.,   1.,   4.,   0.,   3.,   5.,  12.,
              5.,  19.,   2.],
           [  5.,   8.,   3.,   4.,   0.,   2.,   4.,   0.,   3.,   5.,  13.,
              3.,  19.,   2.],
           [  5.,   8.,   3.,   4.,   0.,   2.,   4.,   0.,   4.,   5.,  11.,
              4.,  19.,   2.],
           [  5.,   8.,   3.,   4.,   0.,   2.,   4.,   0.,   3.,   5.,  11.,
              5.,  19.,   2.],
           [  5.,   8.,   3.,   4.,   0.,   1.,   4.,   0.,   3.,   5.,  12.,
              5.,  19.,   2.],
           [  5.,   8.,   3.,   4.,   0.,   1.,   4.,   0.,   3.,   5.,  12.,
              5.,  19.,   2.]])

这是我的回应y

y = np.array([ 70.14963195,  70.20937046,  70.20890363,  70.14310389,
        70.18076206,  70.13179977,  70.13536797,  70.10700998,
        70.09194074,  70.09958111])

岭回归

    # alpha = 0.1
    model = Ridge(alpha = 0.1)
    model.fit(X,y)
    model.score(X,y)   # gives 0.36898424479816627

    # alpha = 0.01
    model1 = Ridge(alpha = 0.01)
    model1.fit(X,y)
    model1.score(X,y)     # gives 0.3690347045143918 > 0.36898424479816627

    # alpha = 0.001
    model2 = Ridge(alpha = 0.001)
    model2.fit(X,y)
    model2.score(X,y)  #gives 0.36903522192901728 > 0.3690347045143918

    # alpha = 0.0001
    model3 = Ridge(alpha = 0.0001)
    model3.fit(X,y)
    model3.score(X,y)  # gives 0.36903522711624259 > 0.36903522192901728

因此从这里应该清楚 alpha = 0.0001 是最佳选择。确实阅读文档它说分数是决定系数。如果最接近1的系数描述最好的模型。现在让我们看看 RidgeCV 告诉我们什么

RidgeCV 回归

modelCV = RidgeCV(alphas = [0.1, 0.01, 0.001,0.0001], store_cv_values = True)
modelCV.fit(X,y)
modelCV.alpha_  #giving 0.1
modelCV.score(X,y)  # giving 0.36898424479812919 which is the same score as ridge regression with alpha = 0.1

出了什么问题？当然，我们可以手动检查，就像我所做的那样，所有其他 alpha 都更好。所以它不仅没有选择最好的 alpha，而是选择了最差的!

谁能给我解释一下这是怎么回事？

最佳答案

这是完全正常的行为。

您的手动方法是不进行任何交叉验证，因此训练数据和测试数据是相同的!

# alpha = 0.1
model = Ridge(alpha = 0.1)
model.fit(X,y)   #!!
model.score(X,y) #!!

通过对分类器(例如凸优化问题)和求解器(保证 epsilon 收敛)的一些温和假设，这意味着，您将始终获得最小正则化模型的最低分数(过度拟合!):在你的例子中:alpha = 0.0001。 (看看 RidgeRegression 的 formula )

尽管使用 RidgeCV，默认情况下会激活交叉验证，留一法被选中。用于确定最佳参数的评分过程未使用相同的数据进行训练和测试。

您可以在使用 store_cv_values = True 时打印出平均值 cv_values_:

print(np.mean(modelCV.cv_values_, axis=0))
# [ 0.00226582  0.0022879   0.00229021  0.00229044]
# alpha [0.1, 0.01, 0.001,0.0001]
# by default: mean squared errors!
# left / 0.1 best; right / 0.0001 worst 
# this is only a demo: not sure how sklearn selects best (mean vs. ?)

这是预期的，但不是一般规则。由于您现在使用两个不同的数据集进行评分，因此您正在优化以防止过度拟合，并且很有可能需要进行一些正则化!

关于python - Scikit 学习 : RidgeCV seems not to give the best option?，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46193900/

python - Scikit 学习 : RidgeCV seems not to give the best option?

上一篇：python - 减少列值之间的数据框

下一篇：python - 使用 pandas 从 url 读取导入的 csv 文件时出错