python - 为 sklearn 管道获取 "valueError: could not convert string to float: ..."

标签 python scikit-learn pipeline

我是一名尝试学习 sklearn 管道的初学者。当我运行下面的代码时,我得到了 ValueError: could not convert string to float 的值错误。我不确定这是什么原因,因为 OneHotEncoder 将字符串转换为分类变量的 float 应该没有任何问题

import json
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier


df = pd.read_csv('https://raw.githubusercontent.com/pplonski/datasets-for-start/master/adult/data.csv', skipinitialspace=True)
x_cols = [c for c in df.columns if c!='income']
X = df[x_cols]
y = df['income']
y = LabelEncoder().fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)

preprocessor = ColumnTransformer(
transformers=[
    ('imputer', SimpleImputer(strategy='most_frequent'),['workclass','education','native-country']),
    ('onehot', OneHotEncoder(), ['workclass', 'education', 'marital-status',
                'occupation', 'relationship', 'race', 'sex','native-country'])
]
)

clf = Pipeline([('preprocessor', preprocessor),
                ('classifier', RandomForestClassifier())])
clf.fit(X_train, y_train)

最佳答案

不幸的是,scikit-learn 的 SimpleImputer 在尝试估算字符串变量时出现问题。这是关于它的一个公开问题 github page .

要解决这个问题,我建议将您的管道分成两个步骤。一个仅用于替换空值和 2) 其余的,如下所示:

cols_with_null = ['workclass','education','native-country']
preprocessor = ColumnTransformer(
    transformers=[
        (
            'imputer', 
            SimpleImputer(missing_values=np.nan, strategy='most_frequent'),
            cols_with_null),
    ])

preprocessor.fit(X_train)
X_train_new = preprocessor.transform(X_train)

for icol, col in enumerate(cols_with_null):
    X_train.loc[:, col] = X_train_new[:, icol]

# confirm no null values in these columns:
for col in cols_with_null:
    print('{}, null values: {}'.format(col, pd.isnull(X_train[col]).sum()))

现在您的 X_train 没有空值,其余的应该在没有 SimpleImputer 的情况下工作:

preprocessor = ColumnTransformer(
transformers=[
    ('onehot', OneHotEncoder(), ['workclass', 'education', 'marital-status',
                'occupation', 'relationship', 'race', 'sex','native-country'])])

clf = Pipeline([('preprocessor', preprocessor),
                ('classifier', RandomForestClassifier())])

clf.fit(X_train, y_train)

关于python - 为 sklearn 管道获取 "valueError: could not convert string to float: ...",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/69107032/

相关文章:

jenkins - 具有多个命名空间的 Kubernetes 集群上下文

python - 使用 SimpleImputer 和 OneHotEncoder 的管道 - 如何正确执行?

python - 如何在 python 上绘制 8 位图像的 16,32 和 64 bin 直方图?

python - Scikit-learn 中逻辑回归的第一次迭代的初始估计是多少?

python - 如何从 pytest 回溯中删除库代码调用?

python - 如何从 Google Cloud Storage 存储桶加载保存在 joblib 文件中的模型

python - sklearn GMM分类预测(组件分配)顺序

python - 如何在管道中重新采样文本(不平衡组)?

python - 两段Python代码之间的差异

python - 有没有一种更简单的方法来比较一套胜过另一套的牌?