python - RandomizedSearchCV 使用相同的 random_state 给出不同的结果

标签 python machine-learning scikit-learn random-seed grid-search

我正在使用管道通过 RandomizedSearchCV 执行特征选择和超参数优化。以下是代码摘要:

from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.grid_search import RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from scipy.stats import randint as sp_randint

rng = 44

X_train, X_test, y_train, y_test = 
   train_test_split(data[features], data['target'], random_state=rng)


clf = RandomForestClassifier(random_state=rng)
kbest = SelectKBest()
pipe = make_pipeline(kbest,clf)

upLim = X_train.shape[1]
param_dist = {'selectkbest__k':sp_randint(upLim/2,upLim+1),
  'randomforestclassifier__n_estimators': sp_randint(5,150),
  'randomforestclassifier__max_depth': [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, None],
  'randomforestclassifier__criterion': ["gini", "entropy"],
  'randomforestclassifier__max_features': ['auto', 'sqrt', 'log2']}
clf_opt = RandomizedSearchCV(pipe, param_distributions= param_dist, 
                             scoring='roc_auc', n_jobs=1, cv=3, random_state=rng)
clf_opt.fit(X_train,y_train)
y_pred = clf_opt.predict(X_test)

我为 train_test_splitRandomForestClassiferRandomizedSearchCV 使用常量 random_state。但是,如果我多次运行上面的代码,结果会略有不同。更具体地说,我的代码中有几个测试单元,这些略有不同的结果导致测试单元失败。我不应该因为使用相同的 random_state 而获得相同的结果吗?我的代码中是否遗漏了在代码的一部分中产生随机性的任何内容?

最佳答案

我通常会回答我自己的问题!我会把它留给有类似问题的其他人:

为了确保避免任何随机性,我定义了一个随机种子。代码如下:

from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.grid_search import RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from scipy.stats import randint as sp_randint

seed = np.random.seed(22)

X_train, X_test, y_train, y_test = 
   train_test_split(data[features], data['target'])


clf = RandomForestClassifier()
kbest = SelectKBest()
pipe = make_pipeline(kbest,clf)

upLim = X_train.shape[1]
param_dist = {'selectkbest__k':sp_randint(upLim/2,upLim+1),
  'randomforestclassifier__n_estimators': sp_randint(5,150),
  'randomforestclassifier__max_depth': [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, None],
  'randomforestclassifier__criterion': ["gini", "entropy"],
  'randomforestclassifier__max_features': ['auto', 'sqrt', 'log2']}
clf_opt = RandomizedSearchCV(pipe, param_distributions= param_dist, 
                             scoring='roc_auc', n_jobs=1, cv=3)
clf_opt.fit(X_train,y_train)
y_pred = clf_opt.predict(X_test)

希望对大家有帮助!

关于python - RandomizedSearchCV 使用相同的 random_state 给出不同的结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41516150/

相关文章:

python - 处理 pandas.datetime 类型时出现消息 "Exception ignored"

python - 如何使用LinearRegression获得重要性F,R平方?

python - 使用 one-hot 编码处理 sklearn 中的分类变量

machine-learning - 测试和分数小部件 - 结果发生变化吗?

image-processing - mobilenet等图像分类模型中如何确定未知类别?

machine-learning - `sklearn.model_selection.RandomizedSearchCV` 是如何工作的?

python - 类型错误 : unsupported operand type(s) for -: 'numpy.ndarray' and 'numpy.ndarray'

python - 如何在不重新加载的情况下返回主页

python - 在 Python 中获取 Exchange 分发列表的成员

python - `os.system` 产生 256 的倍数?