python - 使用数字、分类和文本管道制作 ColumnTransformer

标签 python scikit-learn

我正在尝试创建一个处理数字、分类和文本变量的管道。我希望在运行分类器之前将数据输出到新的数据帧。我收到以下错误

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 2499 and the array at index 2 has size 1.

请注意,2499 是我的训练数据的大小。如果我删除管道的 text_preprocessing 部分,我的代码就可以工作。我有什么想法可以让它发挥作用吗?谢谢!

# Categorical pipeline
categorical_preprocessing = Pipeline(
[
    ('Imputation', SimpleImputer(strategy='constant', fill_value='?')),
    ('One Hot Encoding', OneHotEncoder(handle_unknown='ignore')),
]
)

# Numeric pipeline
numeric_preprocessing = Pipeline(
[
     ('Imputation', SimpleImputer(strategy='mean')),
     ('Scaling', StandardScaler())
]
)

text_preprocessing = Pipeline(
[
     ('Text',TfidfVectorizer())       
]
)

# Creating preprocessing pipeline
preprocessing = make_column_transformer(
     (numeric_features, numeric_preprocessing),
     (categorical_features, categorical_preprocessing),
     (text_features,text_preprocessing),
)

# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
)

test = pipeline.fit_transform(x_train)

最佳答案

我认为您已尝试交换 make_column_transformer 中的功能和管道,但在发布问题时并未将其更改回来。

考虑到它们的顺序正确(估计器,列), 当向量化器在 ColumnTransformer 中给出列名称列表时,会发生此错误。由于 sklearn 中的所有向量化器仅采用 1D 数据/迭代器/pd.Series,因此它无法处理/应用多个列。

示例:

import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

x_train = pd.DataFrame({'fruit': ['apple','orange', np.nan],
                        'score': [np.nan, 12, 98],
                        'summary': ['Great performance', 
                                    'fantastic performance',
                                    'Could have been better']}
                        )

# Categorical pipeline
categorical_preprocessing = Pipeline(
[
    ('Imputation', SimpleImputer(strategy='constant', fill_value='?')),
    ('One Hot Encoding', OneHotEncoder(handle_unknown='ignore')),
]
)

# Numeric pipeline
numeric_preprocessing = Pipeline(
[
     ('Imputation', SimpleImputer(strategy='mean')),
     ('Scaling', StandardScaler())
]
)

text_preprocessing = Pipeline(
[
     ('Text',TfidfVectorizer())       
]
)

# Creating preprocessing pipeline
preprocessing = make_column_transformer(
     (numeric_preprocessing, ['score']),
     (categorical_preprocessing, ['fruit']),
     (text_preprocessing, 'summary'),
)

# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
)

test = pipeline.fit_transform(x_train)

如果我改变

    (text_preprocessing, 'summary'),

    (text_preprocessing, ['summary']),

它抛出一个

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 3 and the array at index 2 has size 1

关于python - 使用数字、分类和文本管道制作 ColumnTransformer,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62391670/

相关文章:

python - Django-Parsley 不显示错误

python - 根据像素颜色的函数更改像素颜色

python - 使用 countvectorizer() 和 tfidfvectorizer() 向量化列表列表

machine-learning - 可扩展或在线核外多标签分类器

python - BeautifulSoup : get picture size from html

python - 创建具有正确权限的文件夹 django uploads

python - 将 Pandas 数据框拆分为多个行数相等的数据框

python - Scikit NaN 或无限错误消息

python - scikit-learn 中回归交叉验证的递归特征消除

python - 结合两个机器学习模型的结果