python - 如何在 python 中的管道中结合 LabelBinarizer 和 OneHotEncoder 来处理分类变量?

标签 python machine-learning scikit-learn preprocessor feature-extraction

过去几天我在 stackoverflow 上查找了正确的教程和问答,但没有找到正确的指南,主要是因为显示 LabelBinarizer 或 OneHotEncoder 用例的示例没有显示它如何合并到管道中,反之亦然。反之亦然。

我有一个包含 4 个变量的数据集:

num1    num2    cate1    cate2
3       4       Cat      1
9       23      Dog      0
10      5       Dog      1

num1 和 num2 是数值变量,cate1 和 cate2 是分类变量。我知道在拟合机器学习算法之前我需要以某种方式对分类变量进行编码,但我不太确定在多次尝试后如何在管道中做到这一点。

from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer

# Class that identifies Column type
class Columns(BaseEstimator, TransformerMixin):
    def __init__(self, names=None):
        self.names = names
    def fit (self, X, y=None, **fit_params):
        return self
    def transform(self, X):
        return X[self.names]

# Separate target from training features
y = df['MED']
X = df.drop('MED', axis=1)

X_selected = X.filter(['num1', 'num2', 'cate1', 'cate2'])

# from the selected X, further choose categorical only
X_selected_cat = X_selected.filter(['cate1', 'cate2']) # hand selected since some cat var has value 0, 1

# Find the numerical columns, exclude categorical columns
X_num_cols = X_selected.columns[X_selected.dtypes.apply(lambda c: np.issubdtype(c, np.number))] # list of numeric column names, automated here
X_cat_cols = X_selected_cat.columns # list of categorical column names, previously hand-slected

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, 
                                                    test_size=0.5, 
                                                    random_state=567, 
                                                    stratify=y)

# Pipeline
pipe = Pipeline([
    ("features", FeatureUnion([
        ('numeric', make_pipeline(Columns(names=X_num_cols),StandardScaler())),
        ('categorical', make_pipeline(Columns(names=X_cat_cols)))
    ])),
    ('LR_model', LogisticRegression()),
])

这给了我错误ValueError:无法将字符串转换为 float :'Cat'

用此替换最后第四行

('categorical', make_pipeline(Columns(names=X_cat_cols),OneHotEncoder()))

会给我相同的ValueError:无法将字符串转换为 float :'Cat'

用此替换最后第四行

('categorical', make_pipeline(Columns(names=X_cat_cols),LabelBinarizer(),OneHotEncoder()))
])),

会给我一个不同的错误TypeError:fit_transform()需要2个位置参数,但给出了3个

用此替换最后第四行

('numeric', make_pipeline(Columns(names=X_num_cols),LabelBinarizer())),

会给我这个错误TypeError:fit_transform()需要2个位置参数,但给出了3个

最佳答案

根据 Marcus 的建议,我尝试但无法安装 scikit-learn dev 版本,但发现了类似的东西,名为 category_encoders .

将代码更改为这样即可:

from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer
import category_encoders as CateEncoder

# Class that identifies Column type
class Columns(BaseEstimator, TransformerMixin):
    def __init__(self, names=None):
        self.names = names
    def fit (self, X, y=None, **fit_params):
        return self
    def transform(self, X):
        return X[self.names]

# Separate target from training features
y = df['MED']
X = df.drop('MED', axis=1)

X_selected = X.filter(['num1', 'num2', 'cate1', 'cate2'])

# from the selected X, further choose categorical only
X_selected_cat = X_selected.filter(['cate1', 'cate2']) # hand selected since some cat var has value 0, 1

# Find the numerical columns, exclude categorical columns
X_num_cols = X_selected.columns[X_selected.dtypes.apply(lambda c: np.issubdtype(c, np.number))] # list of numeric column names, automated here
X_cat_cols = X_selected_cat.columns # list of categorical column names, previously hand-slected

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, 
                                                    test_size=0.5, 
                                                    random_state=567, 
                                                    stratify=y)

# Pipeline
pipe = Pipeline([
    ("features", FeatureUnion([
        ('numeric', make_pipeline(Columns(names=X_num_cols),StandardScaler())),
        ('categorical', make_pipeline(Columns(names=X_cat_cols),CateEncoder.BinaryEncoder()))
    ])),
    ('LR_model', LogisticRegression()),
])

关于python - 如何在 python 中的管道中结合 LabelBinarizer 和 OneHotEncoder 来处理分类变量?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49018652/

相关文章:

python - 为什么扩展切片分配不如常规切片分配灵活?

python - 如何在Python随机森林模型中删除可预测值(y)

python - 尝试在 sklearn 中使用点->列表拟合

python - 如何使用 lstm 执行多类多输出分类

python - @ 运算符与 ndarray 或矩阵操作数出现新的意外操作数错误

python eve hook,满足条件时不保存这个文档怎么办?

machine-learning - 机器学习的特征选择

python - Metal 火车出现意外的关键字参数 'n_epochs'

python - 录音更新时间

machine-learning - 关联规则挖掘中的最大模式与封闭模式