我在 Boston 上尝试随机森林算法借助 sklearn 的 RandomForestRegressor
预测房价的数据集 medv
.In all I tried 3 iterations
如下
迭代 1:使用具有默认超参数的模型
#1. import the class/model
from sklearn.ensemble import RandomForestRegressor
#2. Instantiate the estimator
RFReg = RandomForestRegressor(random_state = 1, n_jobs = -1)
#3. Fit the model with data aka model training
RFReg.fit(X_train, y_train)
#4. Predict the response for a new observation
y_pred = RFReg.predict(X_test)
y_pred_train = RFReg.predict(X_train)
迭代 1 的结果
{'RMSE Test': 2.9850839211419435, 'RMSE Train': 1.2291604936401441}
迭代 2:我使用了 RandomizedSearchCV获得超参数的最优值
from sklearn.ensemble import RandomForestRegressor
RFReg = RandomForestRegressor(n_estimators = 500, random_state = 1, n_jobs = -1)
param_grid = {
'max_features' : ["auto", "sqrt", "log2"],
'min_samples_split' : np.linspace(0.1, 1.0, 10),
'max_depth' : [x for x in range(1,20)]
from sklearn.model_selection import RandomizedSearchCV
CV_rfc = RandomizedSearchCV(estimator=RFReg, param_distributions =param_grid, n_jobs = -1, cv= 10, n_iter = 50)
CV_rfc.fit(X_train, y_train)
所以我得到了如下最佳超参数
CV_rfc.best_params_
#{'min_samples_split': 0.1, 'max_features': 'auto', 'max_depth': 18}
CV_rfc.best_score_
#0.8021713812777814
所以我用如下最佳超参数训练了一个新模型
#1. import the class/model
from sklearn.ensemble import RandomForestRegressor
#2. Instantiate the estimator
RFReg = RandomForestRegressor(n_estimators = 500, random_state = 1, n_jobs = -1, min_samples_split = 0.1, max_features = 'auto', max_depth = 18)
#3. Fit the model with data aka model training
RFReg.fit(X_train, y_train)
#4. Predict the response for a new observation
y_pred = RFReg.predict(X_test)
y_pred_train = RFReg.predict(X_train)
迭代 2 的结果
{'RMSE Test': 3.2836794902147926, 'RMSE Train': 2.71230367772569}
迭代 3:我使用 GridSearchCV获得超参数的最优值
from sklearn.ensemble import RandomForestRegressor
RFReg = RandomForestRegressor(n_estimators = 500, random_state = 1, n_jobs = -1)
param_grid = {
'max_features' : ["auto", "sqrt", "log2"],
'min_samples_split' : np.linspace(0.1, 1.0, 10),
'max_depth' : [x for x in range(1,20)]
}
from sklearn.model_selection import GridSearchCV
CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10, n_jobs = -1)
CV_rfc.fit(X_train, y_train)
所以我得到了如下最佳超参数
CV_rfc.best_params_
#{'max_depth': 12, 'max_features': 'auto', 'min_samples_split': 0.1}
CV_rfc.best_score_
#0.8021820114800677
迭代 3 的结果
{'RMSE Test': 3.283690568225705, 'RMSE Train': 2.712331014201783}
我的函数评估 RMSE
def model_evaluate(y_train, y_test, y_pred, y_pred_train):
metrics = {}
#RMSE Test
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred))
#RMSE Train
rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
metrics = {
'RMSE Test': rmse_test,
'RMSE Train': rmse_train}
return metrics
所以我在 3 次迭代后有以下问题
- 为什么即使我使用
RandomSearchCV
和GridSearchCV
,tuned 模型的结果也比使用默认参数的模型差。理想情况下,模型在使用交叉验证进行调整时应该给出良好的结果 - 我知道交叉验证只会针对
param_grid
中存在的值的组合进行。可能有值很好但不包含在我的param_grid
。那么我该如何处理这种情况 - 我如何决定我应该为
max_features
、min_samples_split
、max_depth
或这对机器学习模型中的任何超参数都很重要,以提高其准确性。(这样我至少可以获得比具有默认超参数的模型更好的调整模型)
最佳答案
Why are the results of tuned model worst than the model with default parameters even when I am using RandomSearchCV and GridSearchCV. Ideally the model should give good results when tuned with cross-validation
你的第二个问题有点回答你的第一个问题,但我试图在波士顿数据集上重现你的结果,我得到了带有默认参数的 {'test_rmse':3.987, 'train_rmse':1.442}
, {'test_rmse':3.98, 'train_rmse':3.426}
用于随机搜索和 {'test_rmse':3.993, 'train_rmse':3.481}
使用网格搜索。然后我使用 hyperopt
和以下参数空间
{'max_depth': hp.choice('max_depth', range(1, 100)),
'max_features': hp.choice('max_features', range(1, x_train.shape[1])),
'min_samples_split': hp.uniform('min_samples_split', 0.1, 1)}
大约 200 次运行后结果如下所示,
所以我将空间扩大到 'min_samples_split', 0.01, 1
这让我得到了 {'test_rmse':3.278, 'train_rmse':1.716}
和 的最佳结果>min_samples_split
等于 0.01。根据文档,min_samples_split
的公式是 ceil(min_samples_split * n_samples)
,在我们的例子中给出 np.ceil(0.1 * len(x_train))
=34 对于像这样的小数据集来说可能有点大。
I know that cross-validation will take place only for the combination of values present in param_grid.There could be values which are good but not included in my param_grid. So how do I deal with this kind of situation
How do I decide what range of values I should try for max_features, min_samples_split, max_depth or for that matter any hyper-parameters in a machine learning model to increase its accuracy.(So that I can atleast get a better tuned model than the model with default hyper-parameters)
你无法提前知道这一点,所以你必须对每种算法进行研究,看看通常搜索什么样的参数空间(这方面的好来源是 kaggle,例如 google kaggle kernel random forest
), 合并它们,考虑你的数据集特征并使用某种 Bayesian Optimization 优化它们算法(为此有 multiple existing libraries)尝试优化选择新的参数值。
关于python - 随机森林中的超参数调整,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53544996/