python - XGBoost 和 scikit 优化 : BayesSearchCV and XGBRegressor are incompatible - why?

我有一个非常大的数据集(700 万行，54 个特征)，我想使用 XGBoost 拟合回归模型。 .为了训练最好的模型，我想使用 BayesSearchCV来自 scikit-optimize对不同的超参数组合重复运行拟合，直到找到性能最佳的集合。
对于给定的一组超参数，XGBoost训练模型需要很长时间，所以为了找到最好的超参数而不用花几天时间在训练折叠、超参数等的每个排列上，我想同时多线程 XGBoost和 BayesSearchCV .我的代码的相关部分如下所示:

xgb_pipe = Pipeline([('clf', XGBRegressor(random_state = 42,  objective='reg:squarederror', n_jobs = 1))])

xgb_fit_params = {'clf__early_stopping_rounds': 5, 'clf__eval_metric': 'mae', 'clf__eval_set': [[X_val.values, y_val.values]]}

xgb_kfold = KFold(n_splits = 5, random_state = 42)

xgb_unsm_cv = BayesSearchCV(xgb_pipe, xgb_params, cv = xgb_kfold, n_jobs = 2, n_points = 1, n_iter = 15, random_state = 42, verbose = 4, scoring = 'neg_mean_absolute_error', fit_params = xgb_fit_params)

xgb_unsm_cv.fit(X_train.values, y_train.values)

但是，我发现当 n_jobs > 1在 BayesSearchCV调用，拟合崩溃，我收到以下错误:

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGKILL(-9)}

每当我在 BayesSearchCV 中使用超过 1 个线程时，此错误仍然存在调用，并且与我提供的内存无关。
这是XGBoost 之间的一些根本不兼容吗？和 scikit-optimize ，或者两个包是否可以以某种方式强制一起工作？如果没有某种多线程优化方法，我担心拟合我的模型需要数周时间才能执行。我能做些什么来解决这个问题？

最佳答案

我不认为该错误与库的不兼容有关。相反，由于您要求两个不同的多线程操作，因此您的内存正在耗尽，因为您的程序试图将完整的数据集放入 RAM 中，而不是一次，而是针对多个实例(取决于线程)两次。

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGKILL(-9)}

段错误是指系统可用内存不足的错误。
请注意，XGBoost 是一个 RAM 饥饿的野兽，将它与另一个多线程操作结合起来势必会造成损失(个人而言，不建议与日常驱动程序机器一起使用。)
最可行的解决方案可能是使用 Google 的 TPU 或其他一些云服务(注意成本)，或者使用一些技术来减少数据集的大小，以便使用一些统计技术进行处理，例如本文 kaggle notebook 中提到的那些技术。和 Data Science StackExchange Article .
这个想法是，要么升级硬件(金钱成本)，要么直接使用单线程 BayesianCV(时间成本)，要么使用最适合您的技术缩小数据。
最后，答案仍然是这些库可能是兼容的，只是数据对于可用 RAM 来说太大了。

关于python - XGBoost 和 scikit 优化 : BayesSearchCV and XGBRegressor are incompatible - why?，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/68340752/

python - XGBoost 和 scikit 优化 : BayesSearchCV and XGBRegressor are incompatible - why?

上一篇：c++ - 传递指针参数时意外的模板实例化

下一篇：amazon-web-services - │ 错误 : Reference to undeclared resource