Python - 遇到 x_test y_test 拟合错误

标签 python arrays pandas scikit-learn

我已经构建了一个神经网络,它在大约 300,000 行、2 个分类变量和 1 个自变量的小型数据集上运行良好,但当我将其增加到 650 万行时遇到内存错误。所以我决定修改代码并越来越接近,但现在我遇到了拟合错误的问题。我有 2 个分类变量和一列用于 1 和 0 的因变量(可疑或不可疑。开始数据集看起来像这样:

DBF2
   ParentProcess                   ChildProcess               Suspicious
0  C:\Program Files (x86)\Wireless AutoSwitch\wrl...    ...            0
1  C:\Program Files (x86)\Wireless AutoSwitch\wrl...    ...            0
2  C:\Windows\System32\svchost.exe                      ...            1
3  C:\Program Files (x86)\Wireless AutoSwitch\wrl...    ...            0
4  C:\Program Files (x86)\Wireless AutoSwitch\wrl...    ...            0
5  C:\Program Files (x86)\Wireless AutoSwitch\wrl...    ...            0

我的代码遵循/有错误:

import pandas as pd
import numpy as np
import hashlib
import matplotlib.pyplot as plt
import timeit

X = DBF2.iloc[:, 0:2].values
y = DBF2.iloc[:, 2].values#.ravel()

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1 = LabelEncoder()
X[:, 0] = labelencoder_X_1.fit_transform(X[:, 0])
labelencoder_X_2 = LabelEncoder()
X[:, 1] = labelencoder_X_2.fit_transform(X[:, 1])

onehotencoder = OneHotEncoder(categorical_features = [0,1])
X = onehotencoder.fit_transform(X)

index_to_drop = [0, 2039]
to_keep = list(set(xrange(X.shape[1]))-set(index_to_drop))
X = X[:,to_keep]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)

#ERROR
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/sklearn/base.py", line 517, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/data.py", line 590, in fit
    return self.partial_fit(X, y)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/data.py", line 621, in partial_fit
    "Cannot center sparse matrices: pass `with_mean=False` "
ValueError: Cannot center sparse matrices: pass `with_mean=False` instead. See docstring for motivation and alternatives.

X_test = sc.transform(X_test)

#ERROR
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/data.py", line 677, in transform
    check_is_fitted(self, 'scale_')
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 768, in check_is_fitted
    raise NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.exceptions.NotFittedError: This StandardScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

如果这对我打印 X_train 和 y_train 有帮助:

X_train
<5621203x7043 sparse matrix of type '<type 'numpy.float64'>'
with 11242334 stored elements in Compressed Sparse Row format>

y_train
array([0, 0, 0, ..., 0, 0, 0])

最佳答案

X_train 是一个稀疏矩阵,当您像您的案例一样使用大型数据集时,它非常有用。问题是作为 documentation解释:

with_mean : boolean, True by default

If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.

你可以尝试传递 with_mean=False :

sc = StandardScaler(with_mean=False)
X_train = sc.fit_transform(X_train)

以下行失败,因为 sc 仍然是未触及的 StandardScaler 对象。

X_test = sc.transform(X_test)

要能够使用转换方法,您首先必须使 StandardScaler 适合数据集。如果您的目的是将 StandardScaler 安装在您的训练集上,并使用它将训练集和测试集转换到同一空间,那么您可以按如下方式进行:

sc = StandardScaler(with_mean=False)
X_train_sc = sc.fit(X_train)
X_train = X_train_sc.transform(X_train)
X_test = X_train_sc.transform(X_test)

关于Python - 遇到 x_test y_test 拟合错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52008548/

相关文章:

python - curses - addstr 文本在较大的终端中不可见

python - 我是否应该始终在 `except` 语句中指定异常类型?

python - 没有名为 'pmdarima' 的模块

arrays - swift 3 : binary operator cannot be applied to operands of type int and 'Int?'

javascript - 为什么 for 循环不从数组中删除每个奇数(使用 splice 方法)?

java - Array.toString 返回内存地址,而不是实际值

python - 使用列表读取 Pandas 中的列以创建新的分类列

python - 形状安装

Python - Pandas 转置游戏日志数据

python - Pandas:如何有条件地对两个不同数据框中的值求和