python - sklearn.preprocessing.OneHotEncoder : using drop and handle_unknown ='ignore'

标签 python machine-learning scikit-learn

我有一些 pandas.Seriess ,下面 - 我想单热编码。我通过研究发现'b'级别对于我的预测建模任务并不重要。我可以像这样从我的分析中排除它:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

s = pd.Series(['a', 'b', 'c']).values.reshape(-1, 1)

enc = OneHotEncoder(drop=['b'], sparse=False, handle_unknown='error')
enc.fit_transform(s)
# array([[1., 0.],
#        [0., 0.],
#        [0., 1.]])
enc.get_feature_names()
# array(['x0_a', 'x0_c'], dtype=object)

但是当我去改造一个新系列时,一个包含 'b'和一个新的水平,'d' ,我收到一个错误:

new_s = pd.Series(['a', 'b', 'c', 'd']).values.reshape(-1, 1)
enc.transform(new_s)

Traceback (most recent call last): File "", line 1, in File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 390, in transform X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown) File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 124, in _transform raise ValueError(msg) ValueError: Found unknown categories ['d'] in column 0 during transform



这是可以预料的,因为我设置了 handle_unknown='error'以上。但是,我想完全忽略除 ['a', 'c'] 之外的所有类。在拟合和随后的转换步骤中。我试过这个:

enc = OneHotEncoder(drop=['b'], sparse=False, handle_unknown='ignore')
enc.fit_transform(s)
enc.transform(new_s)

Traceback (most recent call last): File "", line 1, in File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 371, in fit_transform self._validate_keywords() File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 289, in _validate_keywords "handle_unknown must be 'error' when the drop parameter is " ValueError: handle_unknown must be 'error' when the drop parameter is specified, as both would create categories that are all zero.



scikit-learn 似乎不支持这种模式。有谁知道一个 scikit-learn-compatible 模式来完成这个任务?

最佳答案

它看起来像 sklearn.preprocessing.LabelBinarizer 可以适用于这个用例,因为它没有任何参数指定是出错还是忽略新类:

>>> import pandas as pd
>>> from sklearn.preprocessing import LabelBinarizer
>>> s = pd.Series(['a', 'b', 'c']).values.reshape(-1, 1)
>>> enc = LabelBinarizer()
>>> enc.fit_transform(s)
array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1]])
>>> enc.classes_
array(['a', 'b', 'c'], dtype='<U1')
>>> new_s = pd.Series(['a', 'b', 'c', 'd']).values.reshape(-1, 1)
>>> enc.transform(new_s)
array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1],
       [0, 0, 0]])

关于python - sklearn.preprocessing.OneHotEncoder : using drop and handle_unknown ='ignore' ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60008477/

相关文章:

Python:如何生成在随机位置具有非零条目的向量?

python - 在 Pandas 中展开数组

python - Tensorflow 自定义正则化术语将预测与真实值进行比较

python - SelectKBest 与 GaussianNB 结果不精确/一致

python - 我找不到一种方法来使用 sklearn pandas 中数据框中的数据来避免值错误

python - 使用pandas python将带有sheet2中出现次数的关键字添加到sheet1中的现有excel文件中

python - 如何在 Sagemaker 中获取特定模型图像的 Amazon ECR 容器 URI?

python - CNN 仅针对 binary_crossentropy 损失函数收敛并且在测试数据集上失败

machine-learning - 训练人工神经网络时验证数据去哪里?

python - scikit-learn 内核 PCA 解释方差