python - 为什么 OLS 回归的 `sklearn` 和 `statsmodels` 实现给出不同的 R^2？

无意中我注意到，sklearn 和 statsmodels 实现的 OLS 模型在不拟合截距时会产生不同的 R^2 值。否则他们似乎工作正常。以下代码产生:

import numpy as np
import sklearn
import statsmodels
import sklearn.linear_model as sl
import statsmodels.api as sm

np.random.seed(42)

N=1000
X = np.random.normal(loc=1, size=(N, 1))
Y = 2 * X.flatten() + 4 + np.random.normal(size=N)

sklernIntercept=sl.LinearRegression(fit_intercept=True).fit(X, Y)
sklernNoIntercept=sl.LinearRegression(fit_intercept=False).fit(X, Y)
statsmodelsIntercept = sm.OLS(Y, sm.add_constant(X))
statsmodelsNoIntercept = sm.OLS(Y, X)

print(sklernIntercept.score(X, Y), statsmodelsIntercept.fit().rsquared)
print(sklernNoIntercept.score(X, Y), statsmodelsNoIntercept.fit().rsquared)

print(sklearn.__version__, statsmodels.__version__)

打印:

0.78741906105 0.78741906105
-0.950825182861 0.783154483028
0.19.1 0.8.0

差异从何而来？

问题不同于Different Linear Regression Coefficients with statsmodels and sklearn因为 sklearn.linear_model.LinearModel(带截距)适用于为 statsmodels.api.OLS 准备的 X。

问题不同于 Statsmodels: Calculate fitted values and R squared 因为它解决了两个 Python 包(statsmodels 和 scikit-learn)之间的差异，而链接的问题是关于 statsmodels 和常见的 R^2 定义。他们都得到了相同的答案，但是这个问题已经在这里讨论过了:Does the same answer imply that the questions should be closed as duplicate?

最佳答案

正如@user333700 在评论中指出的那样，statsmodels 实现中的 R^2 的 OLS 定义与 scikit-learn 中的不同。

来自 documentation of RegressionResults class (强调我的):

rsquared

R-squared of a model with an intercept. This is defined here as 1 - ssr/centered_tss if the constant is included in the model and 1 - ssr/uncentered_tss if the constant is omitted.

来自 documentation of LinearRegression.score() :

score(X, y, sample_weight=None)

Returns the coefficient of determination R^2 of the prediction.

The coefficient R^2 is defined as (1 - u/v), where u is the residual

sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

关于python - 为什么 OLS 回归的 `sklearn` 和 `statsmodels` 实现给出不同的 R^2？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48832925/

python - 为什么 OLS 回归的 `sklearn` 和 `statsmodels` 实现给出不同的 R^2？

上一篇：python - Opencv VideoCapture 在 Heroku 上总是返回 false

下一篇：python - 如何按特定值过滤 tf.data.Dataset？