python - 如何使用 sklearn pipeline 跟踪 catboost 的分类索引

标签 python scikit-learn catboost

我想跟踪 sklearn 管道中的分类特征索引,以便将它们提供给 CatBoostClassifier。

我在管道的 fit() 之前从一组分类特征开始。 管道本身会更改数据结构并在特征选择步骤中删除特征。

我如何预先知道哪些分类特征将被删除或添加到管道中? 当我调用 fit() 方法时,我需要知道更新的列表索引。 问题是,我的数据集在转换后可能会发生变化。

这是我的数据框的示例:

data = pd.DataFrame({'pet':      ['cat', 'dog', 'dog', 'fish', np.nan, 'dog', 'cat', 'fish'],
                     'children': [4., 6, 3, np.nan, 2, 3, 5, 4],
                     'salary':   [90., 24, np.nan, 27, 32, 59, 36, 27],
                     'gender':   ['male', 'male', 'male', 'male', 'male', 'male', 'male', 'male'],
                     'happy':    [0, 1, 1, 0, 1, 1, 0, 0]})

categorical_features = ['pet', 'gender']
numerical_features = ['children', 'salary']
target = 'happy'

print(data)

     pet    children    salary  gender  happy
0    cat    4.0         90.0    male    0
1    dog    6.0         24.0    male    1
2    dog    3.0         NaN     male    1
3    fish   NaN         27.0    male    0
4    NaN    2.0         32.0    male    1
5    dog    3.0         59.0    male    1
6    cat    5.0         36.0    male    0
7    fish   4.0         27.0    male    0

现在我想运行一个包含多个步骤的管道。 这些步骤之一是 VarianceThreshold(),在我的例子中,这将导致“性别”从数据帧中删除。

X, y = data.drop(columns=[target]), data[target]

pipeline = Pipeline(steps=[
    (
        'preprocessing',
        ColumnTransformer(transformers=[
            (
                'categoricals',
                Pipeline(steps=[
                    ('fillna_with_frequent', SimpleImputer(strategy='most_frequent')),
                    ('ordinal_encoder', OrdinalEncoder())
                ]),
                categorical_features
            ),
            (
                'numericals',
                Pipeline(steps=[
                    ('fillna_with_mean', SimpleImputer(strategy='mean'))
                ]),
                numerical_features
            )
        ])
    ),
    (
        'feature_selection',
        VarianceThreshold()
    ),
    (
        'estimator',
        CatBoostClassifier()
    )
])

现在,当我尝试获取 CatBoost 的分类特征索引列表时,我无法判断“性别”不再是我的数据帧的一部分。

cat_features = [data.columns.get_loc(col) for col in categorical_features]
print(cat_features)
[0, 3]

索引 0、3 是错误的,因为在 VarianceThreshold 之后,特征 3(性别)将被删除。

pipeline.fit(X, y, estimator__cat_features=cat_features)
---------------------------------------------------------------------------
CatBoostError                             Traceback (most recent call last)
<ipython-input-230-527766a70b4d> in <module>
----> 1 pipeline.fit(X, y, estimator__cat_features=cat_features)

~/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    265         Xt, fit_params = self._fit(X, y, **fit_params)
    266         if self._final_estimator is not None:
--> 267             self._final_estimator.fit(Xt, y, **fit_params)
    268         return self
    269 

~/anaconda3/lib/python3.7/site-packages/catboost/core.py in fit(self, X, y, cat_features, sample_weight, baseline, use_best_model, eval_set, verbose, logging_level, plot, column_description, verbose_eval, metric_period, silent, early_stopping_rounds, save_snapshot, snapshot_file, snapshot_interval, init_model)
   2801         self._fit(X, y, cat_features, None, sample_weight, None, None, None, None, baseline, use_best_model,
   2802                   eval_set, verbose, logging_level, plot, column_description, verbose_eval, metric_period,
-> 2803                   silent, early_stopping_rounds, save_snapshot, snapshot_file, snapshot_interval, init_model)
   2804         return self
   2805 

~/anaconda3/lib/python3.7/site-packages/catboost/core.py in _fit(self, X, y, cat_features, pairs, sample_weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, use_best_model, eval_set, verbose, logging_level, plot, column_description, verbose_eval, metric_period, silent, early_stopping_rounds, save_snapshot, snapshot_file, snapshot_interval, init_model)
   1231         _check_train_params(params)
   1232 
-> 1233         train_pool = _build_train_pool(X, y, cat_features, pairs, sample_weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, column_description)
   1234         if train_pool.is_empty_:
   1235             raise CatBoostError("X is empty.")

~/anaconda3/lib/python3.7/site-packages/catboost/core.py in _build_train_pool(X, y, cat_features, pairs, sample_weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, column_description)
    689             raise CatBoostError("y has not initialized in fit(): X is not catboost.Pool object, y must be not None in fit().")
    690         train_pool = Pool(X, y, cat_features=cat_features, pairs=pairs, weight=sample_weight, group_id=group_id,
--> 691                           group_weight=group_weight, subgroup_id=subgroup_id, pairs_weight=pairs_weight, baseline=baseline)
    692     return train_pool
    693 

~/anaconda3/lib/python3.7/site-packages/catboost/core.py in __init__(self, data, label, cat_features, column_description, pairs, delimiter, has_header, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count)
    318                         )
    319 
--> 320                 self._init(data, label, cat_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names)
    321         super(Pool, self).__init__()
    322 

~/anaconda3/lib/python3.7/site-packages/catboost/core.py in _init(self, data, label, cat_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names)
    638             cat_features = _get_cat_features_indices(cat_features, feature_names)
    639             self._check_cf_type(cat_features)
--> 640             self._check_cf_value(cat_features, features_count)
    641         if pairs is not None:
    642             self._check_pairs_type(pairs)

~/anaconda3/lib/python3.7/site-packages/catboost/core.py in _check_cf_value(self, cat_features, features_count)
    360                 raise CatBoostError("Invalid cat_features[{}] = {} value type={}: must be int().".format(indx, feature, type(feature)))
    361             if feature >= features_count:
--> 362                 raise CatBoostError("Invalid cat_features[{}] = {} value: must be < {}.".format(indx, feature, features_count))
    363 
    364     def _check_pairs_type(self, pairs):

CatBoostError: Invalid cat_features[1] = 3 value: must be < 3.

我期望 cat_features 为 [0],但实际输出为 [0, 3]。

最佳答案

您收到错误的原因是您当前的 cat_features 源自您的 non_transformed 数据集。为了解决这个问题,您必须在数据集转换后派生您的 cat_features 。 这就是我跟踪我的方法:我将转换器安装到数据集,检索数据集并将其转换为 pandas 数据框,然后检索分类索引

column_transform = ColumnTransformer([('n', MinMaxScaler(), numerical_idx)], remainder='passthrough')
scaled_X = column_transform.fit_transform(X)
new_df = pd.DataFrame(scaled_X)
new_df = new_df.infer_objects() # converts the datatype to their most accurate datatype
cat_features_new = [new_df.columns.get_loc(col) for col in new_df.select_dtypes(include=['object', 'bool']).columns]

关于python - 如何使用 sklearn pipeline 跟踪 catboost 的分类索引,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56742441/

相关文章:

python - 在 OpenCV 中使用 StereoBM 的错误视差图

python - 如何在pygtk中添加opencv窗口?

python - 如何将 Scikit Learn 分类器应用于大图像中的图 block /窗口

python - 如何使用sklearn的IncrementalPCApartial_fit

offset - catboost 回归器的 base_margin 或 init_score

python - 如何检查我的机器上是否安装了 IPython 以及安装这些库的顺序?

python - 如何使用 groupby 子句中包含的列创建数据框?

python - 使用 sklearn GMM 计算概率

r - 在模型中使用权重来处理不平衡数据

python - cat boost 功能在测试数据集中具有 'Categorical type in training data but ' Float' 类型