python - 我们能否通过接受(或忽略)新功能使 ML 模型(pickle 文件)更加健壮？

我训练了一个 ML 模型，并将其存储到 Pickle 文件中。

在我的新脚本中，我正在阅读新的“真实世界数据”，我想对其进行预测。

然而，我正在挣扎。我有一列(包含字符串值)，例如:

Sex       
Male       
Female
# This is just as example, in real it is having much more unique values

现在问题来了。 我收到了一个新的(唯一的)值，现在我无法再进行预测(例如添加了 'Neutral')。
由于我正在改造 'Sex'列到傻瓜中，我确实遇到了我的模型不再接受输入的问题，

Number of features of the model must match the input. Model n_features is 2 and input n_features is 3

因此我的问题是:有没有办法让我的模型健壮，而忽略这个类？但是在没有具体信息的情况下进行预测？
我尝试过的:

df = pd.read_csv('dataset_that_i_want_to_predict.csv')
model = pickle.load(open("model_trained.sav", 'rb'))

# I have an 'example_df' containing just 1 row of training data (this is exactly what the model needs)
example_df = pd.read_csv('reading_one_row_of_trainings_data.csv')

# Checking for missing columns, and adding that to the new dataset 
missing_cols = set(example_df.columns) - set(df.columns)
for column in missing_cols:
    df[column] = 0 #adding the missing columns, with 0 values (Which is ok. since everything is dummy)

# make sure that we have the same order 
df = df[example_df.columns] 

# The prediction will lead to an error!
results = model.predict(df)

# ValueError: Number of features of the model must match the input. Model n_features is X and n_features is Y

请注意，我进行了搜索，但找不到任何有用的解决方案(不是 here 、 here 或 here
更新
还找到了 this文章。但同样的问题......我们可以使测试集与训练集具有相同的列......但是新的现实世界数据(例如新值“Neutral”)呢？

最佳答案

是的，在训练部分完成后，您不能将新类别或特征包含(更新模型)到数据集中。OneHotEncoder可能会处理在测试数据中的某些特征中包含新类别的问题。
它将负责使训练和测试数据中的列在分类变量方面保持一致。

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
from sklearn import set_config
set_config(print_changed_only=True)
df = pd.DataFrame({'feature_1': np.random.rand(20),
                   'feature_2': np.random.choice(['male', 'female'], (20,))})
target = pd.Series(np.random.choice(['yes', 'no'], (20,)))

model = Pipeline([('preprocess',
                   ColumnTransformer([('ohe',
                                       OneHotEncoder(handle_unknown='ignore'), [1])],
                                       remainder='passthrough')),
                  ('lr', LogisticRegression())])

model.fit(df, target)

# let us introduce new categories in feature_2 in test data
test_df = pd.DataFrame({'feature_1': np.random.rand(20),
                        'feature_2': np.random.choice(['male', 'female', 'neutral', 'unknown'], (20,))})
model.predict(test_df)
# array(['yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
#       'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
#       'yes', 'yes'], dtype=object)

关于python - 我们能否通过接受(或忽略)新功能使 ML 模型(pickle 文件)更加健壮？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/64910582/

python - 我们能否通过接受(或忽略)新功能使 ML 模型(pickle 文件)更加健壮？

上一篇：python - pickle:它如何 pickle 一个函数？

下一篇：tensorflow - 如何通过 tensorflow 的 tf.data API 加载 pickle 文件