python - xgboost 的哪些功能受种子(random_state)影响？

Python API除了 seed= 之外，没有提供更多信息参数传递给 numpy.random.seed :

seed (int) – Seed used to generate the folds (passed to numpy.random.seed).

但是xgboost有什么特点使用 numpy.random.seed ?

运行 xgboost即使在更改种子时，所有默认设置仍然产生相同的性能。

我已经可以验证colsample_bytree这样做；不同的种子产生不同的性能。

我被告知它也被 subsample 使用和另一个 colsample_*特征，这似乎是合理的，因为任何形式的采样都需要随机性。

xgboost还有哪些特点依赖 numpy.random.seed ?

最佳答案

提升树按顺序生长，一次迭代中的树生长分布在线程之间。为了避免过度拟合，随机性是通过以下参数引入的:

colsample_bytree

colsample_bylevel

colsample_bynode

subsample (注意 *sample* 模式)

shuffle在创建用于交叉验证的 CV 文件夹中

此外，您可能会在以下地方遇到不确定性，不受随机状态控制:

[GPU] histogram building is not deterministic due to the nonassociative aspect of floating point summation.

Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm

when using GPU ranking objective, the result is not deterministic due to the non-associative aspect of floating point summation.

评论回复:你怎么知道的？
要知道它是有帮助的:

要了解树木的生长方式:Demystify Modern Gradient Boosting Trees (引用可能也有帮助)

扫描文档 full text利息条款:random , sample , deterministic , determinism等等。

最后(首先？)，了解为什么需要从袋装树(Leo Breiman 的 RANDOM FORESTS)和神经网络(François Chollet 的 Deep learning with Python，关于过度拟合的章节)等对应物进行采样和类似案例也可能会有所帮助。

关于python - xgboost 的哪些功能受种子(random_state)影响？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/65523909/