我有一个 Pandas 数据框,它在特定列中有一些 NaN 值:
1291 NaN
1841 NaN
2049 NaN
Name: some column, dtype: float64
我已经制作了以下管道来处理它:
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
scaler = StandardScaler(with_mean = True)
imputer = SimpleImputer(strategy = 'median')
logistic = LogisticRegression()
pipe = Pipeline([('imputer', imputer),
('scaler', scaler),
('logistic', logistic)])
现在,当我将此管道传递给
RandomizedSearchCV
时,我收到以下错误:ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
它实际上比那要长得多 - 如果需要,我可以在编辑中发布整个错误。无论如何,我很确定此列是唯一包含 NaN 的列。此外,如果我从
SimpleImputer
切换到(现已弃用)Imputer
在管道中,管道在我的 RandomizedSearchCV
中工作得很好.我查看了文档,但似乎是 SimpleImputer
应该以(几乎)与 Imputer
完全相同的方式运行.行为有什么不同?如何在我的管道中使用输入器而不使用已弃用的 Imputer
?
最佳答案
make_pipeline 中的 SimpleImputer
preprocess_pipeline = make_pipeline(
FeatureUnion(transformer_list=[
('Handle numeric columns', make_pipeline(
ColumnSelector(columns=['Amount']),
SimpleImputer(strategy='constant', fill_value=0),
StandardScaler()
)),
('Handle categorical data', make_pipeline(
ColumnSelector(columns=['Type', 'Name', 'Changes']),
SimpleImputer(strategy='constant', missing_values=' ', fill_value='missing_value'),
OneHotEncoder(sparse=False)
))
])
)
流水线中的 SimpleImputer
('features', FeatureUnion ([
('Cat Columns', Pipeline([
('Category Extractor', TypeSelector(np.number)),
('Impute Zero', SimpleImputer(strategy="constant", fill_value=0))
])),
('Numerics', Pipeline([
('Numeric Extractor', TypeSelector("category")),
('Impute Missing', SimpleImputer(strategy="constant", fill_value='missing'))
]))
]))
关于scikit-learn - Sklearn SimpleImputer 在管道中不起作用?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51741873/