machine-learning - 如何在 sklearn - Python3 中使用我自己的数据集

关闭。这个问题需要多问focused 。目前不接受答案。

想要改进此问题吗？更新问题，使其仅关注一个问题 editing this post .

已关闭 5 年前。

我是机器学习和 sklearn 的新手。所以，我有以下问题:

我正在尝试进行线性回归，但我想使用我自己的数据 .txt文件。我有一些包含 3 列的表的数据。

然后，我想知道如何更改以下代码，这是 http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html 中的示例

然后我对之前示例中的代码做了一些更改，并发明了一些数据，这是正确的方法吗？就像使用一些 X和Y像这样。然后我还想知道方程式如何: x_train = x [:2] ，[:2]对我的程序有一定的影响。我并没有真正理解这部分。

from sklearn import linear_model
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score

#X has to be numpy array not list.

x=([0],[1],[2],[3],[4],[5],[6],[7],[8],[9],[10])
y=[5,3,8,3,4,5,5,7,8,9,10]

x_train = x [:2]
x_test = x [2:]

y_train = y[:2]
y_test = y[2:]

regr = linear_model.LinearRegression()
regr.fit (x_train,y_train)

y_pred = regr.predict(x_test)

#coefficient
print('Coefficients: \n', regr.coef_)

#the mean square error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
print('Variance score: %.2f' % r2_score(y_test, y_pred))

plt.scatter(x_test, y_test,  color='black')
plt.plot(x_test, y_pred, color='blue', linewidth=3)
plt.axis([0, 20, 0, 20])
plt.show()

非常感谢!

编辑 1

借助我在此网页中收到的帮助，我尝试编写一些代码，以生成我自己的数据的拟合，但我无法获得正确的拟合，所以如果有人有时间帮助我一点更多信息或告诉我我是否做错了什么。

我在收到的图片中使用的代码

import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

data = pd.read_csv('data.txt')
#x = data[['col1','col2']]
x = data[['col1']]
y = data['col3']

#convert to array to fit the model
x=np.asarray(x)
y=np.asarray(y)

# define the KFolds 
kf = KFold(n_splits=2)

#define the model
regr = linear_model.LinearRegression()

# use cross validation and return the r2 score for each Fold 
#if you want to return other scores than r2, just change the scoring in cross_val_score
scores = cross_val_score(regr, x, y, cv= kf, scoring= 'r2')

print(scores)

for train_index, test_index in kf.split(x):
  print("TRAIN:", train_index, "TEST:", test_index)
  X_train, X_test = x[train_index], x[test_index]
  y_train, y_test = y[train_index], y[test_index]


plt.scatter (X_test, y_test)
plt.show()

我在这里放了一张看起来像我的数据以及我从训练和测试中获得的数据的图片

然后我做了一些拟合程序，但我不确定它是否正确:

regr.fit (X_train, y_train)
y_pred = regr.predict(X_test)
print(y_pred)
plt.scatter(X_test, y_test,  color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)
plt.show()

我的感觉完全奇怪。

我不明白为什么我会得到它，如果当我使用 MINUIT 执行此操作时，我的配合有效。所以，如果有人有一些提示可以帮助我。

为什么程序显然没有使用“y”中的我的数据来进行训练或测试样本？

我的数据可以在这里获取:https://www.dropbox.com/sh/nbbsc0fqznkwxvt/AAD-u6lM4orJOGrgIyz0o8B9a?dl=0

对我来说唯一重要的是 col1 和 col3，col2 应该被忽略。然后我想对这些数据进行拟合并提取拟合值。我知道这是一条适合该数据的线。

谢谢!

最佳答案

首先，要分割数据并使用一部分数据来训练模型，另一部分数据来评估模型，主要原因是为了避免过度拟合。通常，我们使用KFolds或LOO(留一)来进行交叉验证。

以下示例使用 30 个样本、3 个变量并使用 KFolds 进行交叉验证。

import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn import linear_model

#create artificial data with 30 lines (samples) and 3 columns (variables)
x = np.random.rand(30,3)

#create the target variable y
y = range(30)

# convert the list to numpy array (this is needed for fit method of sklearn)
y = np.asarray(y)

# define the KFolds (3 folds in this example)
kf = KFold(n_splits=3)

#define the model
regr = linear_model.LinearRegression()

# use cross validation and return the r2 score for each Fold (here we have 3). 
#if you want to return other scores than r2, just change the scoring in cross_val_score.
scores = cross_val_score(regr, x, y, cv= kf, scoring= 'r2')

print(scores)

结果:

在这里您可以看到模型每次折叠的 r2 分数。因此我们将数据分割 3 次，并使用 3 个不同的训练数据来获取这些值。这是由 sklearn 在 cross_val_score 方法中自动完成的。

 array([-30.36184326,  -0.4149778 , -28.89110233])

要了解 KFold 的作用，您可以使用以下方法打印训练和测试索引:

for train_index, test_index in kf.split(x):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = x[train_index], x[test_index]
   y_train, y_test = y[train_index], y[test_index]

结果:

('TRAIN:', array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
   27, 28, 29]), 'TEST:', array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))
('TRAIN:', array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 20, 21, 22, 23, 24, 25, 26,
   27, 28, 29]), 'TEST:', array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]))
('TRAIN:', array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
   17, 18, 19]), 'TEST:', array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29]))

现在，您可以看到，对于第一次折叠，我们使用了以下示例:10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 、25、26、27、28、29。

接下来，对于第二次折叠，我们使用了示例:0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 20, 21, 22, 23, 24, 25, 26 、27、28、29。

注意:这些数字是 x 数据的索引。例如。 2 表示第 3 个样本(行)。在Python中我们从0开始计数。正如您所看到的，我们不会在每个 Fold 中使用完全相同的数据(样本)。

希望这有帮助。

编辑 1

回答您关于加载txt数据的问题。假设您有一个包含 3 列的 txt 文件。前 2 列是特征，最后一列是 y(目标)。

在这种情况下，您可以使用 pandas 执行以下操作:

import pandas as pd
import numpy as np

data = pd.read_csv('data.txt')
x = data[['col1','col2']]
y = data['col3']

#convert to array to fit the model
x=np.asarray(x)
y=np.asarray(y)

文本在这里:https://ufile.io/eb5xl (选择慢速下载)。

编辑2

这仅用于可视化目的。我不分割数据。我使用所有数据来拟合模型，然后根据相同的数据进行预测。然后我绘制预测值。

import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_predict
from sklearn import linear_model
import matplotlib.pyplot as plt

data = pd.read_csv('data.txt')

x = data[['col1']]
y = data['col3']

#convert to array to fit the model
x=np.asarray(x)
y=np.asarray(y)

regr = linear_model.LinearRegression()
regr.fit(x, y)

y_predicted = regr.predict(x)

plt.scatter(x, y,  color='black')
plt.plot(x, y_predicted, color='blue', linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

结果:

数据似乎不遵循线性模式。应使用其他模型(例如指数拟合)

关于machine-learning - 如何在 sklearn - Python3 中使用我自己的数据集，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47352526/

machine-learning - 如何在 sklearn - Python3 中使用我自己的数据集

上一篇：machine-learning - scikit学习: elastic net approaching ridge

下一篇：python - 如何使用 fmin_ncg 计算成本和 theta