python - 线性回归没有按我的预期工作

标签 python machine-learning scikit-learn linear-regression

在 200K 的 for 循环中训练这个模型,我可以获得 0.97 的精度(这意味着我猜是 97%?),我将其保存在 .pickle 中文件。问题是它看起来不像是学习,因为即使没有训练模型,我也能得到相同的结果,并且精度为 70-90%。好吧,如果我有更高的精度,我会认为它正在学习,但正如我所说,结果没有改变。

无论如何,即使精度达到 70-97%,它也只能给出所有数据的大约 20-45% 的正确结果。正如你所看到的,我对这件事很陌生,我正在遵循以下教程:https://www.youtube.com/watch?v=3AQ_74xrch8

这是代码:

import pandas as pd
import numpy as np
import pickle
import sklearn
from sklearn import linear_model

data = pd.read_csv('student-mat.csv', sep=';')
data = data[['G1', 'G2', 'G3', 'studytime', 'failures', 'absences']]

predict = 'G3'

X = np.array(data.drop([predict], 1))
y = np.array(data[predict])

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.1)

# comment after train the model #
best_accuracy = 0
array_best_accurary = []
for _ in range(200000):
    x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.1)

    linear = linear_model.LinearRegression()
    linear.fit(x_train, y_train)
    accuracy = linear.score(x_test, y_test)

    if accuracy > best_accuracy:
        best_accuracy = accuracy
        array_best_accurary.append(best_accuracy)
        with open('student_model.pickle', 'wb') as f:
            pickle.dump(linear, f)

print(max(array_best_accurary), '\n')
# #

# uncomment after train the model
# picke_in = open('student_model.pickle', 'rb')
# linear = pickle.load(picke_in)

print('Coeficient:\n', linear.coef_)
print('Intercept:\n', linear.intercept_, '\n')

predictions = linear.predict(x_test)

total = len(predictions)
correct_predictions = []

for x in range(total):
    print('Predict', predictions[x], '- Correct', y_test[x])

    if int(predictions[x]) == y_test[x]:
        correct_predictions.append(1)

print('\n')
print('Total:', total)
print('Total correct predicts:', len(correct_predictions))

输出:

0.977506233512022 

Coeficient:
 [ 0.14553549  0.98120042 -0.18857019 -0.31539844  0.03324807]
Intercept:
 -1.3929098924365348 

Predict 9.339230104273398 - Correct 9
Predict -1.7999979510132014 - Correct 0
Predict 18.220125096856393 - Correct 18
Predict 3.5669380684894634 - Correct 0
Predict 8.394034346453692 - Correct 10
Predict 11.17472103817094 - Correct 12
Predict 6.877027043616517 - Correct 7
Predict 13.10046638328761 - Correct 14
Predict 8.460530481589299 - Correct 9
Predict 5.619296478409708 - Correct 9
Predict 5.056861318329287 - Correct 6
Predict -0.4602308511632893 - Correct 0
Predict 5.4907111970972124 - Correct 7
Predict 7.098301508597935 - Correct 0
Predict 9.060702343692888 - Correct 11
Predict 14.906413508421672 - Correct 16
Predict 5.337146104521532 - Correct 7
Predict 6.451206767114973 - Correct 6
Predict 12.005846951225159 - Correct 14
Predict 9.181910373164804 - Correct 0
Predict 7.078728252841696 - Correct 8
Predict 12.944012673326714 - Correct 13
Predict 9.296195408827478 - Correct 10
Predict 9.726422674287734 - Correct 10
Predict 5.872952989811228 - Correct 6
Predict 11.714775970606564 - Correct 12
Predict 10.699461464343582 - Correct 11
Predict 8.079501926145412 - Correct 8
Predict 17.050354493553698 - Correct 17
Predict 11.950269035741151 - Correct 12
Predict 11.907234340295231 - Correct 12
Predict 8.394034346453692 - Correct 8
Predict 9.563804949756388 - Correct 10
Predict 15.08795365845874 - Correct 15
Predict 15.197484489040267 - Correct 14
Predict 9.339230104273398 - Correct 10
Predict 6.72710996076076 - Correct 8
Predict 15.778083095387622 - Correct 16
Predict 8.238497037369088 - Correct 9
Predict 11.357208854852361 - Correct 12


Total: 40
Total correct predicts: 8

我知道这是一个 float ,但即使我将其向上或向下舍入,我仍然没有得到预期的结果。我知道我的代码太简单了,但即使我考虑一个预测为 ==(所需预测 - 1),在上面的输出中,它也会给我 27 个正确的预测,约占总数的 60%。是不是太低了?我预计 70-80% 左右。

我的主要疑问是为什么即使精度为 70-97%,我也能得到大约 20-45% 的正确结果。也许我误解了它的工作原理,有人可以澄清吗?

我正在使用的数据集:https://archive.ics.uci.edu/ml/datasets/Student+Performance

最佳答案

您的问题有几个问题。

首先,在回归设置中(例如您这里的回归设置),我们不使用术语“精度”和“准确度”,这些术语是为分类问题保留的(其中它们具有非常具体的含义,并且与同义词)。

话虽如此,您的下一步是自己弄清楚您的指标是什么,即您的 linear.score(x_test, y_test) 返回的到底是什么。 ;与许多其他类似的设置一样,documentation是你最好的 friend :

score(self, X, y, sample_weight=None)

Returns the coefficient of determination R^2 of the prediction.

因此,您的指标是决定系数 R^2 或 R 平方。

尽管 R^2 值为 0.97 听起来相当不错(有时可以将其解释为 97%,但这并不意味着“正确的预测”),在预测设置中使用该指标(就像这里一样)是相当有问题的;引用我自己的回答 another SO thread :

the whole R-squared concept comes in fact directly from the world of statistics, where the emphasis is on interpretative models, and it has little use in machine learning contexts, where the emphasis is clearly on predictive models; at least AFAIK, and beyond some very introductory courses, I have never (I mean never...) seen a predictive modeling problem where the R-squared is used for any kind of performance assessment; neither it's an accident that popular machine learning introductions, such as Andrew Ng's Machine Learning at Coursera, do not even bother to mention it. And, as noted in the Github thread above (emphasis added):

In particular when using a test set, it's a bit unclear to me what the R^2 means.

我当然同意。

因此,您最好使用预测回归问题的标准指标之一,例如 Mean Squared Error (MSE)Mean Absolute Error (MAE) - 第二个优点是它与因变量的单位相同;由于这两个数量都是错误,因此意味着越低越好。看看可用的 regression metrics in scikit-learn以及如何使用它们。

最后但并非最不重要的一点是,与上面的讨论无关,我看不出您实际上是如何得出对结果的评估的:

Total: 40
Total correct predicts: 8

因为,如果我们应用截断规则(即 15.49 截断为 15,但 15.51 截断为 16),我发现大约一半的预测确实是“正确的”...

关于python - 线性回归没有按我的预期工作,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58135795/

相关文章:

python - 如何在 Pygame 中正确加载图像?

python - 如何在 scikit-learn 中使用管道调整自定义内核函数的参数

Python:对预定 csv 中的多个变量进行 k 均值聚类

numpy - 在pytorch中加载多个.npy文件(大小> 10GB)

machine-learning - 当我需要更新多层感知器中的权重时?

Python:最小最大缩放数组的快速方法

python - 根据第一个数组的条件查找第二个数组中的值。最有效的方法是什么?

python - 代码不会在函数中运行elif语句

python - 如何告诉 MATLAB 我正在导入的数据是一系列向量,而不仅仅是一系列字母?

python - 在 sklearn 中确定 SVM 分类器的最有贡献的特征