python - sklearn GridSearchCV : ValueError: X has 21 features per sample; expecting 19

标签 python scikit-learn

我尝试在 sklearn 中运行 GridSearchCV 进行逻辑回归,但代码给出以下错误:

ValueError: X has 21 features per sample; expecting 19  

训练和测试数据的形状为

X_train.shape
(891L, 21L)  

X_test.shape
(418L, 21L)

我用来运行 GridSearchCV 的代码是

from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
logistic = LogisticRegression()

parameters = [{'C' : [1.0, 10.0, 100.0, 1000.0],
               'fit_intercept' : ['True', 'False'],
               'intercept_scaling' : [0, 1, 10, 100, 1000],
               'class_weight' : ['auto'],
               'random_state' : [26],
               'tol' : [0.001, 0.01, 0.1, 1, 10, 100]
               }]

logistic = GridSearchCV(LogisticRegression(),
                        parameters,
                        cv=3,
                        refit=True,
                        verbose=1)

logistic = logistic.fit(X_train, y_train)
logit_pred = logistic.predict(X_test)

我得到的回溯是:

ValueError                                Traceback (most recent call last)
C:\Code\kaggle\titanic\titanic.py in <module>()
    351 
    352 
--> 353     logistic = logistic.fit(X_train, y_train)
    354 
    355     logit_pred = logistic.predict(X_test)

C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\grid_search.pyc in fit(self, X, y)
    594 
    595         """
--> 596         return self._fit(X, y, ParameterGrid(self.param_grid))
    597 
    598 

C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\grid_search.pyc in _fit(self, X, y, parameter_iterable)
    376                                     train, test, self.verbose, parameters,
    377                                     self.fit_params, return_parameters=True)
--> 378             for parameters in parameter_iterable
    379             for train, test in cv)
    380 

C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.pyc in __call__(self, iterable)
    651             self._iterating = True
    652             for function, args, kwargs in iterable:
--> 653                 self.dispatch(function, args, kwargs)
    654 
    655             if pre_dispatch == "all" or n_jobs == 1:

C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.pyc in dispatch(self, func, args, kwargs)
    398         """
    399         if self._pool is None:
--> 400             job = ImmediateApply(func, args, kwargs)
    401             index = len(self._jobs)
    402             if not _verbosity_filter(index, self.verbose):

C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.pyc in __init__(self, func, args, kwargs)
    136         # Don't delay the application, to avoid keeping the input
    137         # arguments in memory
--> 138         self.results = func(*args, **kwargs)
    139 
    140     def get(self):

C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\cross_validation.pyc in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters)
   1238     else:
   1239         estimator.fit(X_train, y_train, **fit_params)
-> 1240     test_score = _score(estimator, X_test, y_test, scorer)
   1241     if return_train_score:
   1242         train_score = _score(estimator, X_train, y_train, scorer)

C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\cross_validation.pyc in _score(estimator, X_test, y_test, scorer)
   1294         score = scorer(estimator, X_test)
   1295     else:
-> 1296         score = scorer(estimator, X_test, y_test)
   1297     if not isinstance(score, numbers.Number):
   1298         raise ValueError("scoring must return a number, got %s (%s) instead."

C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\metrics\scorer.pyc in _passthrough_scorer(estimator, *args, **kwargs)
    174 def _passthrough_scorer(estimator, *args, **kwargs):
    175     """Function that wraps estimator.score"""
--> 176     return estimator.score(*args, **kwargs)
    177 
    178 

C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\base.pyc in score(self, X, y, sample_weight)
    289         """
    290         from .metrics import accuracy_score
--> 291         return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
    292 
    293 

C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\linear_model\base.pyc in predict(self, X)
    213             Predicted class label per sample.
    214         """
--> 215         scores = self.decision_function(X)
    216         if len(scores.shape) == 1:
    217             indices = (scores > 0).astype(np.int)

C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\linear_model\base.pyc in decision_function(self, X)
    194         if X.shape[1] != n_features:
    195             raise ValueError("X has %d features per sample; expecting %d"
--> 196                              % (X.shape[1], n_features))
    197 
    198         scores = safe_sparse_dot(X, self.coef_.T,

ValueError: X has 21 features per sample; expecting 19

为什么 GridSearchCV 期望的特征数量与数据集包含的特征数量不同?

更新:

感谢安迪的回复。数据集均为 numpy.ndarray 类型,dtype 为 float64。

type(X_Train)    type(y_train)    type(X_test)
numpy.ndarray    numpy.ndarray    numpy.ndarray 

将它们引入 sklearn 之前的步骤:

train_data = traindf.values
test_data = testdf.values

X_train = train_data[0::, 1::]  # training features
y_train = train_data[0::, 0]    # training targets
X_test = test_data[0::, 0::]    # test features

下一步是我在上面输入的 GridSearchCV 代码...

更新 2:链接到数据

这里是 datasets 的链接

最佳答案

该错误是由 intercept_scaling=0 引起的。看起来像是 scikit-learn 中的一个错误。

关于python - sklearn GridSearchCV : ValueError: X has 21 features per sample; expecting 19,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28888070/

相关文章:

php - 数据库之间的重复值

python - 如何获得遗传算法分类器绘制 ROC 曲线的分数?

python - 如何在Python随机森林模型中删除可预测值(y)

python - 使用 GridSearch 时使用 Scikit-learn 的模型帮助

scikit-learn - 指定质心的 kmeans 聚类变换方法

c# - 从 C# 运行多个 python 脚本

python - 修复 x 轴缩放和自动缩放 y 轴

python - 无法安装 librosa python,如何卸载 llvmlite?

python - Ident 连接通过 psycopg2 失败,但通过命令行工作

python - 非加权类别标签 - 准确性不会改变