python - 特征选择时出错

标签 python machine-learning scikit-learn feature-selection

我正在尝试为多标签分类进行特征选择。我提取了将在 X 中训练模型的特征。模型测试是在同一个 X 上完成的。我正在使用 Pipeline 并选择最佳的 100 个特征 -

#arrFinal contains all the features and the labels. Last 16 columns are labels and features are from 1 to 521. 17th column from the last is not taken
X=np.array(arrFinal[:,1:-17])
Xtest=np.array(X)
Y=np.array(arrFinal[:,522:]).astype(int)
clf = Pipeline([('chi2', SelectKBest(chi2, k=100)),('rbf',SVC())])
clf = OneVsRestClassifier(clf)
clf.fit(X, Y)
ans=clf.predict(X_test)

但是我收到以下错误-

Traceback (most recent call last):
  File "C:\Users\50004182\Documents\\callee.py", line 10, in <module
>
    combine.combine_main(dict_ids,inv_dict_ids,noOfIDs)
  File "C:\Users\50004182\Documents\combine.py", line 201, in combi
ne_main
    clf.fit(X, Y)
  File "C:\Python34\lib\site-packages\sklearn\multiclass.py", line 287, in fit
    for i, column in enumerate(columns))
  File "C:\Python34\lib\site-packages\sklearn\externals\joblib\parallel.py", lin
e 804, in __call__
    while self.dispatch_one_batch(iterator):
  File "C:\Python34\lib\site-packages\sklearn\externals\joblib\parallel.py", lin
e 662, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Python34\lib\site-packages\sklearn\externals\joblib\parallel.py", lin
e 570, in _dispatch
    job = ImmediateComputeBatch(batch)
  File "C:\Python34\lib\site-packages\sklearn\externals\joblib\parallel.py", lin
e 183, in __init__
    self.results = batch()
  File "C:\Python34\lib\site-packages\sklearn\externals\joblib\parallel.py", lin
e 72, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "C:\Python34\lib\site-packages\sklearn\externals\joblib\parallel.py", lin
e 72, in <listcomp>
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "C:\Python34\lib\site-packages\sklearn\multiclass.py", line 74, in _fit_b
inary
    estimator.fit(X, y)
  File "C:\Python34\lib\site-packages\sklearn\pipeline.py", line 164, in fit
    Xt, fit_params = self._pre_transform(X, y, **fit_params)
  File "C:\Python34\lib\site-packages\sklearn\pipeline.py", line 145, in _pre_tr
ansform
    Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
  File "C:\Python34\lib\site-packages\sklearn\base.py", line 458, in fit_transfo
rm
    return self.fit(X, y, **fit_params).transform(X)
  File "C:\Python34\lib\site-packages\sklearn\feature_selection\univariate_selec
tion.py", line 331, in fit
    self.scores_, self.pvalues_ = self.score_func(X, y)
  File "C:\Python34\lib\site-packages\sklearn\feature_selection\univariate_selec
tion.py", line 213, in chi2
    if np.any((X.data if issparse(X) else X) < 0):
TypeError: unorderable types: numpy.ndarray() < int()

最佳答案

所以,在上面的评论中与 @JamieBull 和 @Joker 进行调试 session 之后。我们想出的解决方案是:

确保类型正确(原始字符串)

X=np.array(arrFinal[:,1:-17]).astype(np.float64)
Xtest=np.array(X)
Y=np.array(arrFinal[:,522:]).astype(int)

首先使用 VarianceThreshold 删除 chi2 之前的常量 (0) 列。

clf = Pipeline([
      ('vt', VarianceThreshold()),
      ('chi2', SelectKBest(chi2, k=100)),
      ('rbf',SVC())
])
clf = OneVsRestClassifier(clf)
clf.fit(X, Y)
ans=clf.predict(X_test)

关于python - 特征选择时出错,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34478319/

相关文章:

python - 如何在sklearn的k-means中检查给定向量的聚类细节

python - 从 pandas 流中提取值

python - 从 TensorFlow 对象中检索数据 - 来自 correct_prediction 的 bool 值列表

python - 在 knn 算法中计算距离而不是欧氏距离的替代有效方法

python - 线性回归 - 使用 MinMaxScaler() 获取特征重要性 - 极大的系数

scikit-learn - scikit-learn 中自定义内核 SVM 的交叉验证

python - 检查矩阵是否在 python 中对角占优势

python - 在kill()之后从子进程中检索一个值

python - 如何在 Anaconda 中拥有两个不同的环境? (Python 3.7,一种 32 位,一种 64 位)

python |喀拉斯 |多变量预测