python - 一种管道可以同时适应文本和分类特征

标签 python python-3.x machine-learning scikit-learn

我正在尝试找到一种方法,使用一个管道来转换文本特征和分类特征,然后将它们适合分类器。

下面的工作示例(为了便于阅读而进行了简化)是我当前正在使用的方法。

我必须分成 3 个迷你管道或变量:

  1. 第一个将对分类特征进行编码,
  2. 第二个将在 raw_text 功能上应用 Tfidf Vectorizer,
  3. 第三个将使分类器适合组合数据(使用 hstack 组合两个功能后)
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import scipy

raw_text_tr = ["kjndn ndoabn mbba odb ob b dboa \n onbn abf  ppfjpfap",
            "ùodnaionf àjùfnàehna nbn obeùfoenen",
            "ùodnaionf àjùfnàehna nbn obeùfoenen dfa e g aze",
            "fjp ,fj)jea ghàhàhà àhàtgjjaz çujàh e ghghàugàh çàéhg \n\n\n\n oddn duhodd"]
categorie_tr = ["cat1","cat2","cat2","cat4"]
target_tr = ["no","no","no","yes"]

raw_text_te = ["ldkdl jaoldldj doizd test yes ok manufajddk p",
            "\n\n\n dopj pdjj pdjaj ada  ohdha hdçh dmamad ldidl h dohdodz"]
categorie_te = ["cat3","cat5"]

train_df = pd.DataFrame(data=list(zip(raw_text_tr, categorie_tr, target_tr)),columns=["raw_text_ft","categorical_ft","target"])
test_df = pd.DataFrame(data=list(zip(raw_text_te, categorie_te)),columns=["raw_text_ft","categorical_ft"])
print(train_df)
#                                          raw_text_ft categorical_ft target
# 0  kjndn ndoabn mbba odb ob b dboa \n onbn abf  p...           cat1     no
# 1                ùodnaionf àjùfnàehna nbn obeùfoenen           cat2     no
# 2    ùodnaionf àjùfnàehna nbn obeùfoenen dfa e g aze           cat2     no
# 3  fjp ,fj)jea ghàhàhà àhàtgjjaz çujàh e ghghàugà...           cat4    yes

print(test_df)
#                                          raw_text_ft categorical_ft
# 0      ldkdl jaoldldj doizd test yes ok manufajddk p           cat3
# 1  \n\n\n dopj pdjj pdjaj ada  ohdha hdçh dmamad ...           cat5

pipeline_tfidf = Pipeline([("tfidf",TfidfVectorizer())])
pipeline_enc = Pipeline([("enc",OneHotEncoder(handle_unknown="ignore"))])
pipeline_clf = Pipeline([("clf",LogisticRegression())])

A_tr = pipeline_tfidf.fit_transform(train_df["raw_text_ft"])
B_tr = pipeline_enc.fit_transform(train_df["categorical_ft"].values.reshape(-1,1))
X_train = scipy.sparse.hstack([A_tr,B_tr])

A_te = pipeline_tfidf.transform(test_df["raw_text_ft"])
B_te = pipeline_enc.transform(test_df["categorical_ft"].values.reshape(-1,1))
X_test = scipy.sparse.hstack([A_te,B_te])

pipeline_clf.fit(X_train, train_df["target"])

是否有一种更简洁的方法将所有这些步骤仅放入一个 Pipeline 中?

下面是我想象的管道,但目前无法正常工作,我正在使用 FeatureUnion 在分类之前组合两个转换后的特征

pipeline_tot = Pipeline([
  ('features', FeatureUnion([
    ('tfidf', TfidfVectorizer()),
    ('enc', OneHotEncoder(handle_unknown="ignore"))
  ])),
  ('clf', LogisticRegression())
])

困难的部分是在拟合管道时如何拆分文本和分类特征(我只能为 pipeline_tot.fit() 函数提供一个元素)

最佳答案

FeatureUnion连接每个应用于整个特征集的转换,同时 ColumnTransformer将转换分别应用于特定特征 您指定的子集:

>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.preprocessing import OneHotEncoder
>>> preprocessor = ColumnTransformer(
...     transformers=[
...         ('text', TfidfVectorizer(), 'raw_text_ft'), #TfidfVectorizer accepts column name only between quotes
...         ('category', OneHotEncoder(), ['categorical_ft']),
...     ],
... )
>>> pipe = Pipeline(
...     steps=[
...         ('preprocessor', preprocessor),
...         ('classifier', LogisticRegression()),
...     ],
... )

关于python - 一种管道可以同时适应文本和分类特征,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57867974/

相关文章:

python - 如何在 Heroku 中安装 NLTK 模块

python - Shell 脚本在其间更改 shell

python - 不缩进打印多行字符串变量

python - 将列表转换为元组,然后将此元组添加到 python 列表中

python - Pytorch:获取最终层的正确尺寸

Python-打开使用作业库创建的pickle文件时出错

python - 将 StratifiedShuffleSplit 与稀疏矩阵一起使用

windows - Python : How to open a folder on Windows Explorer(Python 3. 6.2,Windows 10)

r - R 中多维尺度 (MDS) 的预测值

machine-learning - 如何从 sklearn 包中找出 cv 错误?