python - Scikit 在使用 fit() 函数时学习 GaussianProcessClassifier 内存错误

标签 python pandas scikit-learn classification sklearn-pandas

我有 X_train 和 y_train 作为大小分别为 (32561, 108) 和 (32561,) 的 2 个 numpy.ndarrays。

每次调用适合我的 GaussianProcessClassifier 时,我都会收到内存错误。

>>> import pandas as pd
>>> import numpy as np
>>> from sklearn.gaussian_process import GaussianProcessClassifier
>>> from sklearn.gaussian_process.kernels import RBF
>>> X_train.shape
(32561, 108)
>>> y_train.shape
(32561,)
 >>> gp_opt = GaussianProcessClassifier(kernel=1.0 * RBF(length_scale=1.0))
>>> gp_opt.fit(X_train,y_train)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 613, in fit
    self.base_estimator_.fit(X, y)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 209, in fit
    self.kernel_.bounds)]
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 427, in _constrained_optimization
    fmin_l_bfgs_b(obj_func, initial_theta, bounds=bounds)
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 199, in fmin_l_bfgs_b
    **opts)
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 335, in _minimize_lbfgsb
    f, g = func_and_grad(x)
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 285, in func_and_grad
    f = fun(x, *args)
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 292, in function_wrapper
    return function(*(wrapper_args + args))
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 63, in __call__
    fg = self.fun(x, *args)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 201, in obj_func
    theta, eval_gradient=True)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 338, in log_marginal_likelihood
    K, K_gradient = kernel(self.X_train_, eval_gradient=True)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/kernels.py", line 753, in __call__
    K1, K1_gradient = self.k1(X, Y, eval_gradient=True)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/kernels.py", line 1002, in __call__
    K = self.constant_value * np.ones((X.shape[0], Y.shape[0]))
  File "/home/retsim/.local/lib/python2.7/site-packages/numpy/core/numeric.py", line 188, in ones
    a = empty(shape, dtype, order)
MemoryError
>>> 

为什么会出现此错误,我该如何解决?

最佳答案

根据 Scikit-Learn documentation ,估计器GaussianProcessClassifier(以及GaussianProcessRegressor),有一个参数copy_X_train,默认设置为True :

class sklearn.gaussian_process.GaussianProcessClassifier(kernel=None, optimizer=’fmin_l_bfgs_b’, n_restarts_optimizer=0, max_iter_predict=100, warm_start=False, copy_X_train=True, random_state=None, multi_class=’one_vs_rest’, n_jobs=1)

参数copy_X_train 的说明指出:

If True, a persistent copy of the training data is stored in the object. Otherwise, just a reference to the training data is stored, which might cause predictions to change if the data is modified externally.

我曾尝试在具有 32 GB RAM 的 PC 上使用 OP 提到的类似大小的训练数据集(观察和特征)来拟合估算器。当 copy_X_train 设置为 True 时,“训练数据的持久副本” 可能会耗尽我的 RAM,导致 MemoryError。将此参数设置为 False 解决了这个问题。

Scikit-Learn 的描述指出,基于此设置“仅存储对训练数据的引用,如果外部修改数据,这可能会导致预测发生变化”。我对这句话的解释是:

Instead of storing the whole training dataset (in the form of a matrix of size nxn based on n observations) in the fitted estimator, only a reference to this dataset is stored - hence avoiding the high RAM usage. As long as the dataset stays intact externally (i.e not within the fitted estimator), it can be reliably fetched when a prediction has to be made. Modification of the dataset affects the predictions.

可能会有更好的解释和理论解释。

关于python - Scikit 在使用 fit() 函数时学习 GaussianProcessClassifier 内存错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49524761/

相关文章:

c++ - 将 Python 的 Django 集成到 C++ 应用程序中

python - 管道在 sklearn python 中给出不同的答案

python - sklearn随机森林索引feature_importances_如何做

python - pydrive.auth.RefreshError : Access token refresh failed: invalid_grant: Token has been expired or revoked

python - 将轴标题添加到 3D 曲面图袖扣和绘图

python - 如何重用函数来排序对象的不同属性

python - Pandas 数据框 : select multiple rows based on entries in other rows

python - 如何在 Python 中绘制宽度可变但没有间隙的条形图,并将条形宽度添加为 x 轴上的标签?

python - 根据上面的行创建新的数据框行

Python Sklearn 协方差矩阵对角线条目不正确?