python - 为什么 GridSearchCV 方法的精度低于标准方法?

标签 python decision-tree grid-search hyperparameters train-test-split

我使用 train_test_split (random_state = 0) 和没有任何参数调整的决策树来为我的数据建模,我运行它大约 50 次以达到最佳精度。

import pandas as pd
import numpy as np

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

Laptop = pd.ExcelFile(r"D:\Laptop.xlsx",  data_only=True)
data = pd.read_excel(r"D:\Laptop.xlsx",sheet_name=0)

train, test = train_test_split(data, test_size = 0.15)
print("Training size: {}; Test size: {}".format(len(train), len(test)))

c = DecisionTreeClassifier()

features = ["Brand", "Size", "CPU", "RAM", "Resolution", "Class"]

x_train = train[features]
y_train = train["K=20"]
x_test = test[features]
y_test = test["K=20"]

dt = c.fit(x_train, y_train)

y_pred = c.predict(x_test)

from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, y_pred)*100

print ("Accuracy using Decision Tree:", round(score, 1), "%")

第二步,我决定使用GridSearchCV方法来设置树的参数。

import pandas as pd
import numpy as np

from sklearn import tree
from sklearn.model_selection import train_test_split

from matplotlib import pyplot as plt
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
%matplotlib inline

Laptop = pd.ExcelFile(r"D:\Laptop.xlsx",  data_only=True)
data = pd.read_excel(r"D:\Laptop.xlsx",sheet_name=0)

train, test = train_test_split(data, test_size = 0.15, random_state = 0)
print("Training size: {}; Test size: {}".format(len(train), len(test)))

features = ["Brand", "Size", "CPU", "RAM", "Resolution", "Class"]

x_train = train[features]
y_train = train["K=20"]
x_test = test[features]
y_test = test["K=20"]

from sklearn.model_selection import GridSearchCV

param_dist = {"max_depth":[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
          "min_samples_leaf":randint (10,60)}

tree = DecisionTreeClassifier()
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)
tree_cv.fit(x_train, y_train)

print("Tuned Decisio Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is: {}".format(tree_cv.best_score_))

y_pred = tree_cv.predict(x_test)

from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, y_pred)*100

print ("Accuracy using Decision Tree:", round(score, 1), "%")

我在第一种方法中的最佳准确度比 GridSearchCV 方法好得多。

为什么会这样?

您知道以最准确的方式获得最好的树的最佳方法吗?

最佳答案

为什么会这样?

没有看到您的代码,我只能推测。它可能基于您的网格的粒度。如果您要进行 50 种组合,但有数十亿种可能的组合,那么这作为搜索空间就毫无意义。有没有一种方法可以优化您正在搜索的参数?

您知道以最准确的方式获得最好的树的最佳方法吗?

这是一个难题,因为您需要定义准确性。您可以构建一个过度拟合测试数据的模型。从技术上讲,获得最佳树的方法是尝试超参数的所有可能组合,但是对于任何合理数量的参数,这将永远需要。通常,您最好的方法是使用贝叶斯方法来搜索您的超参数空间,但您将返回每个参数的分布。我的建议是从 RandomSearch 而不是 GridSearch 开始。如果你是 Skopt 的忠实粉丝,你可以使用 BayesSearch。我建议阅读代码,因为我认为它的文档很少。

import pandas as pd
import numpy as np
import xgboost as xgb
from skopt import BayesSearchCV
from sklearn.model_selection import StratifiedKFold

# SETTINGS - CHANGE THESE TO GET SOMETHING MEANINGFUL
ITERATIONS = 10 # 1000
TRAINING_SIZE = 100000 # 20000000
TEST_SIZE = 25000

# Classifier
bayes_cv_tuner = BayesSearchCV(
    estimator = xgb.XGBClassifier(
        n_jobs = 1,
        objective = 'binary:logistic',
        eval_metric = 'auc',
        silent=1,
        tree_method='approx'
    ),
    search_spaces = {
        'learning_rate': (0.01, 1.0, 'log-uniform'),
        'min_child_weight': (0, 10),
        'max_depth': (0, 50),
        'max_delta_step': (0, 20),
        'subsample': (0.01, 1.0, 'uniform'),
        'colsample_bytree': (0.01, 1.0, 'uniform'),
        'colsample_bylevel': (0.01, 1.0, 'uniform'),
        'reg_lambda': (1e-9, 1000, 'log-uniform'),
        'reg_alpha': (1e-9, 1.0, 'log-uniform'),
        'gamma': (1e-9, 0.5, 'log-uniform'),
        'min_child_weight': (0, 5),
        'n_estimators': (50, 100),
        'scale_pos_weight': (1e-6, 500, 'log-uniform')
    },    
    scoring = 'roc_auc',
    cv = StratifiedKFold(
        n_splits=3,
        shuffle=True,
        random_state=42
    ),
    n_jobs = 3,
    n_iter = ITERATIONS,   
    verbose = 0,
    refit = True,
    random_state = 42
)

result = bayes_cv_tuner.fit(X.values, y.values)

斯科普特:https://scikit-optimize.github.io/

代码:https://github.com/scikit-optimize/scikit-optimize/blob/master/skopt/searchcv.py

关于python - 为什么 GridSearchCV 方法的精度低于标准方法?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57003251/

相关文章:

grid-search - 网格搜索适用于 TFF 和 FL。?

machine-learning - scikit-learn GridSearchCV 无法与随机森林一起正常工作

scikit-learn - 使用 GridSearchCV 中嵌套的 RFECV 时,如何避免使用 estimator_params?

python - Celery 未运行,没有错误消息

python - 生成具有不同参数的python函数

machine-learning - 决策树桩

java - Apache Spark 决策树预测

python - Pandas 计算满足条件的行的列平均值

python - 将 YAML 文件转换为 Python JSON 对象

java - WEKA 生成的模型似乎无法预测给定属性索引的类别和分布