无意中我注意到,sklearn
和 statsmodels
实现的 OLS 模型在不拟合截距时会产生不同的 R^2 值。否则他们似乎工作正常。以下代码产生:
import numpy as np
import sklearn
import statsmodels
import sklearn.linear_model as sl
import statsmodels.api as sm
np.random.seed(42)
N=1000
X = np.random.normal(loc=1, size=(N, 1))
Y = 2 * X.flatten() + 4 + np.random.normal(size=N)
sklernIntercept=sl.LinearRegression(fit_intercept=True).fit(X, Y)
sklernNoIntercept=sl.LinearRegression(fit_intercept=False).fit(X, Y)
statsmodelsIntercept = sm.OLS(Y, sm.add_constant(X))
statsmodelsNoIntercept = sm.OLS(Y, X)
print(sklernIntercept.score(X, Y), statsmodelsIntercept.fit().rsquared)
print(sklernNoIntercept.score(X, Y), statsmodelsNoIntercept.fit().rsquared)
print(sklearn.__version__, statsmodels.__version__)
打印:
0.78741906105 0.78741906105
-0.950825182861 0.783154483028
0.19.1 0.8.0
差异从何而来?
问题不同于Different Linear Regression Coefficients with statsmodels and sklearn因为 sklearn.linear_model.LinearModel
(带截距)适用于为 statsmodels.api.OLS
准备的 X。
问题不同于
Statsmodels: Calculate fitted values and R squared
因为它解决了两个 Python 包(statsmodels
和 scikit-learn
)之间的差异,而链接的问题是关于 statsmodels
和常见的 R^2 定义。他们都得到了相同的答案,但是这个问题已经在这里讨论过了:Does the same answer imply that the questions should be closed as duplicate?
最佳答案
正如@user333700 在评论中指出的那样,statsmodels
实现中的 R^2 的 OLS 定义与 scikit-learn
中的不同。
来自 documentation of RegressionResults
class (强调我的):
rsquared
R-squared of a model with an intercept. This is defined here as 1 - ssr/centered_tss if the constant is included in the model and 1 - ssr/uncentered_tss if the constant is omitted.
来自 documentation of LinearRegression.score()
:
score(X, y, sample_weight=None)
Returns the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual
sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
关于python - 为什么 OLS 回归的 `sklearn` 和 `statsmodels` 实现给出不同的 R^2?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48832925/