python - 如何在 python 中的管道中结合 LabelBinarizer 和 OneHotEncoder 来处理分类变量？

过去几天我在 stackoverflow 上查找了正确的教程和问答，但没有找到正确的指南，主要是因为显示 LabelBinarizer 或 OneHotEncoder 用例的示例没有显示它如何合并到管道中，反之亦然。反之亦然。

我有一个包含 4 个变量的数据集:

num1    num2    cate1    cate2
3       4       Cat      1
9       23      Dog      0
10      5       Dog      1

num1 和 num2 是数值变量，cate1 和 cate2 是分类变量。我知道在拟合机器学习算法之前我需要以某种方式对分类变量进行编码，但我不太确定在多次尝试后如何在管道中做到这一点。

from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer

# Class that identifies Column type
class Columns(BaseEstimator, TransformerMixin):
    def __init__(self, names=None):
        self.names = names
    def fit (self, X, y=None, **fit_params):
        return self
    def transform(self, X):
        return X[self.names]

# Separate target from training features
y = df['MED']
X = df.drop('MED', axis=1)

X_selected = X.filter(['num1', 'num2', 'cate1', 'cate2'])

# from the selected X, further choose categorical only
X_selected_cat = X_selected.filter(['cate1', 'cate2']) # hand selected since some cat var has value 0, 1

# Find the numerical columns, exclude categorical columns
X_num_cols = X_selected.columns[X_selected.dtypes.apply(lambda c: np.issubdtype(c, np.number))] # list of numeric column names, automated here
X_cat_cols = X_selected_cat.columns # list of categorical column names, previously hand-slected

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, 
                                                    test_size=0.5, 
                                                    random_state=567, 
                                                    stratify=y)

# Pipeline
pipe = Pipeline([
    ("features", FeatureUnion([
        ('numeric', make_pipeline(Columns(names=X_num_cols),StandardScaler())),
        ('categorical', make_pipeline(Columns(names=X_cat_cols)))
    ])),
    ('LR_model', LogisticRegression()),
])

这给了我错误ValueError:无法将字符串转换为 float :'Cat'

用此替换最后第四行

('categorical', make_pipeline(Columns(names=X_cat_cols),OneHotEncoder()))

会给我相同的ValueError:无法将字符串转换为 float :'Cat'。

用此替换最后第四行

('categorical', make_pipeline(Columns(names=X_cat_cols),LabelBinarizer(),OneHotEncoder()))
])),

会给我一个不同的错误TypeError:fit_transform()需要2个位置参数，但给出了3个。

用此替换最后第四行

('numeric', make_pipeline(Columns(names=X_num_cols),LabelBinarizer())),

会给我这个错误TypeError:fit_transform()需要2个位置参数，但给出了3个。

最佳答案

根据 Marcus 的建议，我尝试但无法安装 scikit-learn dev 版本，但发现了类似的东西，名为 category_encoders .

将代码更改为这样即可:

from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer
import category_encoders as CateEncoder

# Class that identifies Column type
class Columns(BaseEstimator, TransformerMixin):
    def __init__(self, names=None):
        self.names = names
    def fit (self, X, y=None, **fit_params):
        return self
    def transform(self, X):
        return X[self.names]

# Separate target from training features
y = df['MED']
X = df.drop('MED', axis=1)

X_selected = X.filter(['num1', 'num2', 'cate1', 'cate2'])

# from the selected X, further choose categorical only
X_selected_cat = X_selected.filter(['cate1', 'cate2']) # hand selected since some cat var has value 0, 1

# Find the numerical columns, exclude categorical columns
X_num_cols = X_selected.columns[X_selected.dtypes.apply(lambda c: np.issubdtype(c, np.number))] # list of numeric column names, automated here
X_cat_cols = X_selected_cat.columns # list of categorical column names, previously hand-slected

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, 
                                                    test_size=0.5, 
                                                    random_state=567, 
                                                    stratify=y)

# Pipeline
pipe = Pipeline([
    ("features", FeatureUnion([
        ('numeric', make_pipeline(Columns(names=X_num_cols),StandardScaler())),
        ('categorical', make_pipeline(Columns(names=X_cat_cols),CateEncoder.BinaryEncoder()))
    ])),
    ('LR_model', LogisticRegression()),
])

关于python - 如何在 python 中的管道中结合 LabelBinarizer 和 OneHotEncoder 来处理分类变量？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49018652/

python - 如何在 python 中的管道中结合 LabelBinarizer 和 OneHotEncoder 来处理分类变量？

上一篇：machine-learning - 如何使用 Google Dialogflow 从标题中提取屏幕尺寸

下一篇：python - IncrementalPCA 和partial_fit - 组件数量