python-3.x - 使用 imblearn 管道进行交叉验证之前或之后是否会发生过采样？

标签 python-3.x scikit-learn xgboost imblearn

在对训练数据进行交叉验证以验证我的超参数之前，我已将我的数据拆分为训练/测试。我有一个不平衡的数据集，想在每次迭代中执行 SMOTE 过采样，所以我使用 imblearn 建立了一个管道.

我的理解是应该在将数据分成k-fold后进行过采样，以防止信息泄漏。使用 Pipeline 时是否保留了此操作顺序(数据拆分为 k 折、k-1 折过采样、预测剩余折)？在下面的设置中？

from imblearn.pipeline import Pipeline
model = Pipeline([
        ('sampling', SMOTE()),
        ('classification', xgb.XGBClassifier())
    ])


param_dist = {'classification__n_estimators': stats.randint(50, 500),
              'classification__learning_rate': stats.uniform(0.01, 0.3),
              'classification__subsample': stats.uniform(0.3, 0.6),
              'classification__max_depth': [3, 4, 5, 6, 7, 8, 9],
              'classification__colsample_bytree': stats.uniform(0.5, 0.5),
              'classification__min_child_weight': [1, 2, 3, 4],
              'sampling__ratio': np.linspace(0.25, 0.5, 10)
             }

random_search = RandomizedSearchCV(model,
                                   param_dist,
                                   cv=StratifiedKFold(n_splits=5),
                                   n_iter=10,
                                   scoring=scorer_cv_cost_savings)
random_search.fit(X_train.values, y_train)

最佳答案

你的理解是对的。当您喂食 pipeline如 model ，训练数据(k-1)使用 .fit() 应用并在 k 上完成测试第一次折叠。然后对训练数据进行采样。

documentation用于 imblearn.pipeline .fit()说:

Fit the model

Fit all the transforms/samplers one after the other and transform/sample the data, then fit the transformed/sampled data using the final estimator.

关于python-3.x - 使用 imblearn 管道进行交叉验证之前或之后是否会发生过采样？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56011706/

上一篇： Angular 路由参数 'state' 在类型 'NavigationExtras 中不存在

下一篇：jquery-steps - 使用 API 更改 jquery-steps 中的设置

相关文章：

javascript - ws4py - 发送/接收消息不起作用

python - 有没有任何pythonic方法可以找到数组中特定元组元素的平均值？

python - Gaussian NB fit() 函数需要固定长度的向量

python - 如何分割我的图像和标签，使其可以用作机器学习的特征？

python - XGBoost 决策树选择

python - 将推文创建时间转换为 UTC

python - Twitter API 返回 401(未经授权)、无效或过期的 token

python - 如何为 scikit-learn 分类器获取信息量最大的特征？

r - R 中 XGBoost 在分类变量值不完整的数据中的应用

python - xgb.plot_tree 字体大小 python