python - 值错误: Input has n_features=10 while the model has been trained with n_features=4261

标签 python machine-learning scikit-learn

我正在尝试使用经过训练的 BoW、tfidf 和 SVM 模型进行预测:

def bagOfWords(files_data):
    count_vector = sklearn.feature_extraction.text.CountVectorizer()
    return count_vector.fit_transform(files_data)

files = sklearn.datasets.load_files(dir_path)
word_counts = util.bagOfWords(files.data)
tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True).fit(word_counts)
X = tf_transformer.transform(word_counts)
clf = sklearn.svm.LinearSVC()
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(X, y, test_size=test_size)

我可以运行以下命令:

clf.fit(X_train, y_train)
y_predicted = clf.predict(X_test)

但是下面会出现错误:

clf.fit(X_train, y_train)
new_word_counts = util.bagOfWords(["a place to listen to music it s making its way to the us"]) 
ready_to_be_predicted = tf_transformer.transform(new_word_counts)
predicted = clf.predict(ready_to_be_predicted)

我认为我已经在使用以前的 tf_transform,并且不知道为什么仍然出现错误。非常感谢任何帮助!

最佳答案

您没有保留最初用于拟合数据的 CountVectorizer。

此 bagOfWords 调用在其自己的范围内安装一个单独的 CountVectorizer。

new_word_counts = util.bagOfWords(["a place to listen to music it s making its way to the us"]) 

您想要使用适合您的训练集的那个。

您还使用整个 X 来训练您的 Transformer,包括 X_test。您希望从任何训练(包括转换)中排除您的测试测试。

尝试这样的事情。

files = sklearn.datasets.load_files(dir_path)

# Split in train/test
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(files.data, file.target)

# Fit and tranform with X_train
count_vector = sklearn.feature_extraction.text.CountVectorizer()
word_counts = count_vector.fit_transform(X_train)
tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True)
X_train = tf_transformer.fit_transform(word_counts)

clf = sklearn.svm.LinearSVC()

clf.fit(X_train, y_train)

# Transform X_test
test_word_counts = count_vector.transform(X_test) 
ready_to_be_predicted = tf_transformer.transform(test_word_counts)
X_test = clf.predict(ready_to_be_predicted)

# Test example
new_word_counts = count_vector.transform["a place to listen to music it smaking its way to the us"]) 

ready_to_be_predicted = tf_transformer.transform(new_word_counts)
predicted = clf.predict(ready_to_be_predicted)

当然,将这些转换器组合到管道中要简单得多。
http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

关于python - 值错误: Input has n_features=10 while the model has been trained with n_features=4261,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49946462/

相关文章:

python - Pygame Sprite 旋转太快

python - mysql查询使用python

python - 行拆分后添加字典键和值?

python - 如何使用 SMOTE 将合成数据集保存在 CSV 文件中

Python 机器学习标签和特征

python - 通过小于运行列表以生成更短的列表 python

machine-learning - 如何使用 nngraph 访问中间层的输出?

python - 推特/ Facebook 评论分类为各种类别

python - 多次调用后,多处理池逐渐变慢

scikit-learn - 为什么精度和召回率的值几乎与代表性不足的类别的精度和召回率相同