machine-learning - 将自定义函数放入 Sklearn pipeline 中

在我的分类方案中，有几个步骤，包括:

SMOTE(合成少数过采样技术)
Fisher 特征选择标准
标准化(Z 分数标准化)
SVC(支持向量分类器)

上述方案中要调整的主要参数是百分位数 (2.) 和 SVC 的超参数 (4.)，我想通过网格搜索进行调整。

当前解决方案构建了一个“部分”管道，包括方案中的步骤 3 和 4 clf = Pipeline([('normal',preprocessing.StandardScaler()),('svc',svm.SVC(class_weight ='自动'))]) 并将该方案分为两部分:

调整特征的百分位以通过第一次网格搜索

skf = StratifiedKFold(y)
for train_ind, test_ind in skf:
    X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
    # SMOTE synthesizes the training data (we want to keep test data intact)
    X_train, y_train = SMOTE(X_train, y_train)
    for percentile in percentiles:
        # Fisher returns the indices of the selected features specified by the parameter 'percentile'
        selected_ind = Fisher(X_train, y_train, percentile) 
        X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
        model = clf.fit(X_train_selected, y_train)
        y_predict = model.predict(X_test_selected)
        f1 = f1_score(y_predict, y_test)

将存储 f1 分数，然后对所有百分位数的所有折叠分区进行平均，并返回具有最佳 CV 分数的百分位数。将“百分位数 for 循环”作为内循环的目的是为了允许公平竞争，因为我们在所有百分位数的所有折叠分区上拥有相同的训练数据(包括合成数据)。

确定百分位后，通过第二次网格搜索调整超参数

skf = StratifiedKFold(y)
for train_ind, test_ind in skf:
    X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
    # SMOTE synthesizes the training data (we want to keep test data intact)
    X_train, y_train = SMOTE(X_train, y_train)
    for parameters in parameter_comb:
        # Select the features based on the tuned percentile
        selected_ind = Fisher(X_train, y_train, best_percentile) 
        X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
        clf.set_params(svc__C=parameters['C'], svc__gamma=parameters['gamma'])
        model = clf.fit(X_train_selected, y_train)
        y_predict = model.predict(X_test_selected)
        f1 = f1_score(y_predict, y_test)

它的完成方式非常相似，除了我们调整 SVC 的超参数而不是要选择的特征的百分位数。

我的问题是:

在当前的解决方案中，我仅在 clf 中涉及 3. 和 4.，并在如上所述的两个嵌套循环中“手动”执行 1. 和 2.。有没有办法将所有四个步骤包含在一个管道中并立即完成整个过程？
如果可以保留第一个嵌套循环，那么是否可以(以及如何)使用单个管道简化下一个嵌套循环
```
clf_all = Pipeline([('smote', SMOTE()),
                    ('fisher', Fisher(percentile=best_percentile))
                    ('normal',preprocessing.StandardScaler()),
                    ('svc',svm.SVC(class_weight='auto'))]) 
```
然后简单地使用GridSearchCV(clf_all,parameter_comb)进行调整？

请注意，SMOTE 和 Fisher(排名标准)都必须仅针对每个折叠分区中的训练数据进行。

如果有任何评论，我们将不胜感激。

SMOTE 和 Fisher 如下所示:

def Fscore(X, y, percentile=None):
    X_pos, X_neg = X[y==1], X[y==0]
    X_mean = X.mean(axis=0)
    X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
    deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) +(1.0/(shape(X_neg[0]-1))*X_neg.var(axis=0)
    num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
    F = num/deno
    sort_F = argsort(F)[::-1]
    n_feature = (float(percentile)/100)*shape(X)[1]
    ind_feature = sort_F[:ceil(n_feature)]
    return(ind_feature)

SMOTE 来自https://github.com/blacklab/nyan/blob/master/shared_modules/smote.py ，它返回合成数据。我对其进行了修改，以返回与合成数据及其标签和合成数据堆叠在一起的原始输入数据。

def smote(X, y):
    n_pos = sum(y==1), sum(y==0)
    n_syn = (n_neg-n_pos)/float(n_pos) 
    X_pos = X[y==1]
    X_syn = SMOTE(X_pos, int(round(n_syn))*100, 5)
    y_syn = np.ones(shape(X_syn)[0])
    X, y = np.vstack([X, X_syn]), np.concatenate([y, y_syn])
    return(X, y)

最佳答案

scikit 创建了 FunctionTransformer作为版本 0.17 中预处理类的一部分。它的使用方式与上面答案中 David 对 Fisher 类的实现类似，但灵 active 较差。如果函数的输入/输出配置正确，变压器可以为函数实现 fit/transform/fit_transform 方法，从而允许它在 scikit 管道中使用。

例如，如果管道的输入是一个系列，则转换器将如下所示:


def trans_func(input_series):
    return output_series

from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(trans_func)

sk_pipe = Pipeline([("trans", transformer), ("vect", tf_1k), ("clf", clf_1k)])
sk_pipe.fit(train.desc, train.tag)

其中 vect 是 tf_idf 转换器，clf 是分类器，train 是训练数据集。 “train.desc”是输入到管道的系列文本。

关于machine-learning - 将自定义函数放入 Sklearn pipeline 中，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31259891/

machine-learning - 将自定义函数放入 Sklearn pipeline 中

上一篇：machine-learning - 无论图像如何，Caffe 都会预测相同的类别

下一篇：machine-learning - 如何使用Scikit-learn中的OneVsRestClassifier来分析多类分类预测每个单独类的性能？