python - 在 Sklearn Pipeline 中组合功能

我想使用包含 TfidfVectorizer 和 SVC 的管道。然而，在这两者之间，我想将从非文本数据中提取的一些特征连接到 TfidfVectorizer 的输出。

我已经尝试创建一个自定义类(基于此 tutorial 的方法)来执行此操作，但这似乎不起作用。

这是我到目前为止尝试过的:

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('transformer', CustomTransformer(one_hot_feats)),
    ('clf', MultinomialNB()),
])

parameters = {
    'tfidf__min_df': (5, 10, 15, 20, 25, 30),
    'tfidf__max_df': (0.8, 0.9, 1.0),
    'tfidf__ngram_range': ((1, 1), (1, 2)),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': np.linspace(0.1, 1.5, 15),
    'clf__fit_prior': [True, False],
}

grid_search = GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(df["short description"], labels)

这是CustomTransformer类

class CustomTransformer(TransformerMixin):
"""Class that concatenates the one hot encode category feature with the tfidf data."""

def __init__(self, one_hot_features):
    """Initializes an instance of our custom transformer."""
    self.one_hot_features = one_hot_features

def fit(self, X, y=None, **kwargs):
    """Dummy fit function that does nothing particular."""

    return self

def transform(self, X, y=None, **kwargs):
    """Adds our external features"""
    return numpy.hstack((one_hot_feats, X))

只要 X 不更改自定义类中的维度(可能是与 TransformerMixin 相关的限制)，此方法就有效，但是，在我的例子中，我将在我的数据中附加其他功能。我的自定义类应该从不同的基类继承还是有不同的方法来解决这个问题？

最佳答案

您可以使用 Sklearn 的 FeatureUnion 组合多个功能，并使用 ColumnTransformer 转换特定列:

来自文档:

FeatureUnion

Concatenates results of multiple transformer objects.

This estimator applies a list of transformer objects in parallel to the input data, then concatenates the results. This is useful to combine several feature extraction mechanisms into a single transformer.

ColumnTransformer

Applies transformers to columns of an array or pandas DataFrame.

This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.

在您的情况下，您可以使用 make_column_transformer 来做到这一点

from sklearn.compose import make_column_transformer
pipeline = Pipeline([
    ('transformer',  make_column_transformer((TfidfVectorizer(), ['text_column']),
                                             (OneHotEncoder(), ['categorical_column']),)),
    ('clf', MultinomialNB()),
])

编辑:

在 make_column_transformer 中将 remainder 设置为 'passthrough'因此所有未在转换器中指定的剩余列将自动通过。

关于python - 在 Sklearn Pipeline 中组合功能，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58076004/

python - 在 Sklearn Pipeline 中组合功能

上一篇：python - 从 init 分配属性的快捷方式

下一篇：python - 用于选择多列 Pandas python 的 Groupby

python - 在 Sklearn Pipeline 中组合功能

上一篇：python - 从 __init__ 分配属性的快捷方式

下一篇：python - 用于选择多列 Pandas python 的 Groupby

上一篇：python - 从 init 分配属性的快捷方式