python - 值错误 : Found arrays with inconsistent numbers of samples [ 6 1786]

标签 python machine-learning scikit-learn text-analysis

这是我的代码:

from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import KFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import datasets
import numpy as np

newsgroups = datasets.fetch_20newsgroups(
                subset='all',
                categories=['alt.atheism', 'sci.space']
         )
X = newsgroups.data
y = newsgroups.target

TD_IF = TfidfVectorizer()
y_scaled = TD_IF.fit_transform(newsgroups, y)
grid = {'C': np.power(10.0, np.arange(-5, 6))}
cv = KFold(y_scaled.size, n_folds=5, shuffle=True, random_state=241) 
clf = SVC(kernel='linear', random_state=241)

gs = GridSearchCV(estimator=clf, param_grid=grid, scoring='accuracy', cv=cv)
gs.fit(X, y_scaled) 

我收到错误,我不明白为什么。回溯:

Traceback (most recent call last): File
"C:/Users/Roman/PycharmProjects/week_3/assignment_2.py", line 23, in

gs.fit(X, y_scaled) #TODO: check this line File "C:\Users\Roman\AppData\Roaming\Python\Python35\site-packages\sklearn\grid_search.py",
line 804, in fit
return self._fit(X, y, ParameterGrid(self.param_grid)) File "C:\Users\Roman\AppData\Roaming\Python\Python35\site-packages\sklearn\grid_search.py",
line 525, in _fit
X, y = indexable(X, y) File "C:\Users\Roman\AppData\Roaming\Python\Python35\site-packages\sklearn\utils\validation.py",
line 201, in indexable
check_consistent_length(*result) File "C:\Users\Roman\AppData\Roaming\Python\Python35\site-packages\sklearn\utils\validation.py",
line 176, in check_consistent_length
"%s" % str(uniques))

ValueError: Found arrays with inconsistent numbers of samples: [ 6 1786]

谁能解释为什么会出现这个错误?

最佳答案

我认为您对此处的 Xy 有点困惑。您想要将 X 转换为 tf-idf 向量并使用它针对 y 进行训练。见下文

from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import KFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import datasets
import numpy as np

newsgroups = datasets.fetch_20newsgroups(
                subset='all',
                categories=['alt.atheism', 'sci.space']
         )
X = newsgroups.data
y = newsgroups.target

TD_IF = TfidfVectorizer()
X_scaled = TD_IF.fit_transform(X, y)
grid = {'C': np.power(10.0, np.arange(-1, 1))}
cv = KFold(y_scaled.size, n_folds=5, shuffle=True, random_state=241) 
clf = SVC(kernel='linear', random_state=241)

gs = GridSearchCV(estimator=clf, param_grid=grid, scoring='accuracy', cv=cv)
gs.fit(X_scaled, y)

关于python - 值错误 : Found arrays with inconsistent numbers of samples [ 6 1786],我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35379200/

相关文章:

python - Python 中的 MATLAB spconvert

python - 在 Windows 上安装 P4Python 时出错

machine-learning - 如何训练 NER 识别单词不是实体?

python - 如何在给定特征集作为字典的情况下实现交叉验证和随机森林分类器?

Python:从字符串末尾修剪下划线

Python 检测字符 tesseract ocr 使用 pytesseract 为文本创建 blob

python - 我应该使用什么分类模型?机器学习新手。需要推荐

python - 使用带有间隔的 GridSearchCV

python - PCA 与 sklearn。无法使用 PCA 找出特征选择

python - sklearn : Hyperparameter tuning by gradient descent?