然而,我正在挣扎。我有一列(包含字符串值),例如:
Sex
Male
Female
# This is just as example, in real it is having much more unique values
现在问题来了。 我收到了一个新的(唯一的)值,现在我无法再进行预测(例如添加了 'Neutral'
)。由于我正在改造
'Sex'
列到傻瓜中,我确实遇到了我的模型不再接受输入的问题,Number of features of the model must match the input. Model n_features is 2 and input n_features is 3
因此我的问题是:有没有办法让我的模型健壮,而忽略这个类?但是在没有具体信息的情况下进行预测?
我尝试过的:
df = pd.read_csv('dataset_that_i_want_to_predict.csv')
model = pickle.load(open("model_trained.sav", 'rb'))
# I have an 'example_df' containing just 1 row of training data (this is exactly what the model needs)
example_df = pd.read_csv('reading_one_row_of_trainings_data.csv')
# Checking for missing columns, and adding that to the new dataset
missing_cols = set(example_df.columns) - set(df.columns)
for column in missing_cols:
df[column] = 0 #adding the missing columns, with 0 values (Which is ok. since everything is dummy)
# make sure that we have the same order
df = df[example_df.columns]
# The prediction will lead to an error!
results = model.predict(df)
# ValueError: Number of features of the model must match the input. Model n_features is X and n_features is Y
请注意,我进行了搜索,但找不到任何有用的解决方案(不是 here 、 here 或 here更新
还找到了 this文章。但同样的问题......我们可以使测试集与训练集具有相同的列......但是新的现实世界数据(例如新值“Neutral”)呢?
最佳答案
是的,在训练部分完成后,您不能将新类别或特征包含(更新模型)到数据集中。OneHotEncoder
可能会处理在测试数据中的某些特征中包含新类别的问题。
它将负责使训练和测试数据中的列在分类变量方面保持一致。
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
from sklearn import set_config
set_config(print_changed_only=True)
df = pd.DataFrame({'feature_1': np.random.rand(20),
'feature_2': np.random.choice(['male', 'female'], (20,))})
target = pd.Series(np.random.choice(['yes', 'no'], (20,)))
model = Pipeline([('preprocess',
ColumnTransformer([('ohe',
OneHotEncoder(handle_unknown='ignore'), [1])],
remainder='passthrough')),
('lr', LogisticRegression())])
model.fit(df, target)
# let us introduce new categories in feature_2 in test data
test_df = pd.DataFrame({'feature_1': np.random.rand(20),
'feature_2': np.random.choice(['male', 'female', 'neutral', 'unknown'], (20,))})
model.predict(test_df)
# array(['yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
# 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
# 'yes', 'yes'], dtype=object)
关于python - 我们能否通过接受(或忽略)新功能使 ML 模型(pickle 文件)更加健壮?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64910582/