scikit-learn - Sklearn 管道 : pass a parameter to a custom Transformer?

标签 scikit-learn pipeline transformer

我的 sklearn 中有一个自定义变压器Pipeline 和我想知道如何将参数传递给我的 Transformer :

在下面的代码中,您可以看到我在 Transformer 中使用了字典“权重”。我不希望在我的 Transformer 中定义这个字典,而是从管道中传递它,这样我就可以在网格搜索中包含这个字典。是否可以将字典作为参数传递给我的 Transformer ?

# My custom Transformer
  class TextExtractor(BaseEstimator, TransformerMixin):
        """Concat the 'title', 'body' and 'code' from the results of 
        Stackoverflow query
        Keys are 'title', 'body' and 'code'.
        """
        def fit(self, x, y=None):
            return self

        def transform(self, x):
            # here is the parameter  I want to pass to my transformer
            weight ={'title' : 10, 'body': 1, 'code' : 1}
            x['text'] = weight['title']*x['Title'] +  
            weight['body']*x['Body'] +  
            weight['code']*x['Code']

            return x['text']

param_grid = {
    'min_df' : [10],
    'max_df' : [0.01],
    'max_features': [200],
    'clf' : [sgd]
    # here is the parameter  I want to pass to my transformer
    'weigth' : [{'title' : 10, 'body': 1, 'code' : 1}, {'title' : 1, 'body': 
     1, 'code' : 1}]

}

for g in ParameterGrid(param_grid) :   

    classifier_pipe = Pipeline(

    steps=[    ('textextractor', TextExtractor()), #is it possible to pass 
                my parameter ?
               ('vectorizer', TfidfVectorizer(max_df=g['max_df'], 
                     min_df=g['min_df'], max_features=g['max_features'])),
               ('clf', g['clf']), 
            ],
    )

最佳答案

为此,您只需要添加一个 __init__()类定义开头的方法。在这一步中,您将定义您的类 TextExtractor作为一个参数,你称之为 weight .

这是如何完成的:(为了可重复性,我之前添加了很多代码行 - 鉴于您没有指定任何内容,我编造了一些假数据。我还假设您试图对权重做的是乘以字符串?)

# import all the necessary packages
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import ParameterGrid, GridSearchCV
from sklearn.linear_model import SGDClassifier

import pandas as pd
import numpy as np

#Sample data
X = pd.DataFrame({"Title" : ["T1","T2","T3","T4","T5"], "Body": ["B1","B2","B3","B4","B5"], "Code": ["C1","C2","C3","C4","C5"]})
y = np.array([0,0,1,1,1])

#Define the SGDClassifier
sgd = SGDClassifier()

下面,我只添加了初始化 步:
# My custom Transformer

class TextExtractor(BaseEstimator, TransformerMixin):
    """Concat the 'title', 'body' and 'code' from the results of 
    Stackoverflow query
    Keys are 'title', 'body' and 'code'.


    """

    def __init__(self, weight = {'title' : 10, 'body': 1, 'code' : 1}):

        self.weight = weight

    def fit(self, x, y=None):
        return self

    def transform(self, x):

        x['text'] = self.weight['title']*x['Title'] + self.weight['body']*x['Body'] + self.weight['code']*x['Code']

        return x['text']

请注意,在您未指定的情况下,我默认传递了一个参数值。这取决于你。然后您可以通过执行以下操作来调用您的变压器:
textextractor = TextExtractor(weight = {'title' : 5, 'body': 2, 'code' : 1})
textextractor.transform(X)

这应该返回:
0    T1T1T1T1T1B1B1C1
1    T2T2T2T2T2B2B2C2
2    T3T3T3T3T3B3B3C3
3    T4T4T4T4T4B4B4C4
4    T5T5T5T5T5B5B5C5

然后你可以定义你的参数网格:
param_grid = {
'vectorizer__min_df' : [0.1],
'vectorizer__max_df' : [0.9],
'vectorizer__max_features': [200],
# here is the parameter  I want to pass to my transformer
'textextractor__weight' : [{'title' : 10, 'body': 1, 'code' : 1}, {'title' : 1, 'body': 
 1, 'code' : 1}]
}

最后做:
for g in ParameterGrid(param_grid) :   

classifier_pipe = Pipeline(

steps=[    ('textextractor', TextExtractor(weight = g['textextractor__weight'])), 
           ('vectorizer', TfidfVectorizer(max_df=g['vectorizer__max_df'], 
                 min_df=g['vectorizer__min_df'], max_features=g['vectorizer__max_features'])),
           ('clf', sgd),  ] )

取而代之的是,您可能想要进行 gridsearch,然后需要您编写:
pipe = Pipeline( steps=[    ('textextractor', TextExtractor()), 
           ('vectorizer', TfidfVectorizer()),
           ('clf', sgd) ] )
grid = GridSearchCV(pipe, param_grid, cv = 3)
grid.fit(X,y)

关于scikit-learn - Sklearn 管道 : pass a parameter to a custom Transformer?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55142677/

相关文章:

python - 值错误 : continuous format is not supported

python - 数据框中多列的 LabelBinarizer

azure-devops - 我应该在构建或发布管道的什么地方运行测试?

deep-learning - BertModel 或 BertForPreTraining

machine-learning - 艾伯特不收敛 - HuggingFace

python - 如何在 Python 中快速计算大量向量的余弦相似度?

python - Scikit-learn 中 OneHotEncoder 和 KNNImpute 之间的循环

powershell - Powershell: “break”似乎结束了整个程序,而不仅仅是循环?

Azure devOps 构建管道 : Force fail yaml step. bash 以防出现 isql 问题

python-3.x - PyTorch:用于训练和测试/验证的不同前向方法