python - 在 xgboost python 中预测测试数据时出错

我正在使用 xgboost python 执行文本分类

下面是我正在考虑的训练集

itemid       description                                            category
11802974     SPRO VUH3C1 DIFFUSER VUH1 TRIPLE Space heaters    Architectural Diffusers
10688548     ANTIQUE BRONZE FINISH PUSHBUTTON  switch           Door Bell Pushbuttons
9836436     Descente pour Cable tray fitting and accessories    Tray Cable Drop Outs

我正在使用 Sckit learn 的 counvectorizer 构建描述的文档术语矩阵，该矩阵生成 scipy 矩阵(因为我有 110 万个海量数据，所以我使用稀疏表示来降低空间复杂度)，使用下面的代码

countvec = CountVectorizer()
documenttermmatrix=countvec.fit_transform(trainset['description'])

之后，我将使用上述矩阵应用特征选择

 fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=40)
 documenttermmatrix_train= fs.fit_transform(documenttermmatrix,y1_train)

我正在使用 xgboost 分类器来训练模型

model = XGBClassifier(silent=False)

model.fit(documenttermmatrix_train, y_train,verbose=True)

下面是我正在考虑的测试集

itemid      description                       category
9836442     TRIPLE Space heaters              Architectural Diffusers
13863918    pushbutton switch                  Door Bell Pushbuttons

我正在为测试集构建单独的矩阵，就像我使用下面的代码为训练集所做的那样

 documenttermmatrix_test=countvec.fit_transform(testset['description'])

在预测测试集时，Xgboost 期望训练集的所有特征都在测试集中，但这是不可能的(稀疏矩阵仅表示非零条目)

我无法将训练集和测试集合并到单个数据集中，因为我只需要为训练集进行特征选择

谁能告诉我如何进一步接近？

最佳答案

不要在测试集上使用 countvec.fit_transform()，而只使用 transform()。

更改此行:

documenttermmatrix_test=countvec.fit_transform(testset['description'])

对此:

documenttermmatrix_test=countvec.transform(testset['description'])

这将确保训练集中存在的那些特征仅取自测试集，如果不可用，则将 0 放在那里。

fit_transform() 将忘记以前的训练数据并创建新的矩阵，该矩阵可以具有与以前的输出不同的特征。因此出现错误。

关于python - 在 xgboost python 中预测测试数据时出错，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47407996/

python - 在 xgboost python 中预测测试数据时出错

上一篇：python - 为什么它只打印 6 个字段而不是 7 个 python

下一篇：Python将列表写入csv，无法将值放入列中