python - 如何在当前的词袋分类中添加另一个文本特征？在 Scikit-learn 中

标签 python machine-learning scikit-learn nlp text-classification

我的示例代码:

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(data['Extract'], 
data['Expense Account code Description'], random_state = 0)

from sklearn.pipeline import Pipeline , FeatureUnion
text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1,1))),
              ('tfidf', TfidfTransformer(use_idf = False)),
              ('clf', RandomForestClassifier(n_estimators =100, 
 max_features='log2',criterion = 'entropy')),
 ])
 text_clf = text_clf.fit(X_train, y_train)

我在这里为“提取”列应用词袋模型对“费用帐户代码描述”进行分类，在这里我得到大约 92% 的准确度，但如果我想包含“供应商名称”作为另一个集合输入功能我该怎么做。有什么办法可以和词袋一起做吗？ ,

最佳答案

您可以使用 FeatureUnion。您还需要创建一个新的 Transformer 类，其中包含您需要采取的必要操作，即包括供应商名称、获取虚拟对象。

Feature Union 将适合您的管道。

供引用。

class get_Vendor(BaseEstimator,TransformerMixin):

    def transform(self, X,y):
        return 

lr_tfidf = Pipeline([('features',FeatureUnion([('other',get_vendor()),
        ('vect', tfidf)])),('clf', RandomForestClassifier())])

关于python - 如何在当前的词袋分类中添加另一个文本特征？在 Scikit-learn 中，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50164737/

上一篇：python - 比较两列(在两个文件中)，然后打印相似行和不同行

下一篇：python - 如何在 iPython 中使用 deepreload 进行自动重载

相关文章：

python - 查找前 10 个并将其从厘米转换为英寸 - Python

machine-learning - 在Google BigQuery上训练模型后，如何获得其架构(层次，损失函数等)？

python - 如何可视化用于 kmeans 聚类的 tf-idf 向量的数据点？

python - 没有拦截的 Sklearn RANSAC

python - Macos python 3.9 idle IDLE 无法导入 Tkinter

python - 使用Python 3.7.4版，仍然出现语法错误

machine-learning - 将 keras CNN 应用于新数据集

machine-learning - 逻辑回归的搜索/预测时间复杂度是多少？

python - 为什么我们用 sklearn 导入 scikit-learn？

python - matplotlib 子图 - 数组的索引太多