tensorflow - 使用 ML Engine 进行超参数微调 : Nan error when running with parallel trials

标签 tensorflow machine-learning nan google-cloud-ml

在我对 Google ML Engine 的微调工作中,某些训练配置会导致 NaN 损失,从而导致错误。我希望能够忽略这些试验,并继续使用不同的参数进行微调。

我使用带有fail_on_nan_loss=False的NanTensorHook,当不执行并行试验时,它可以在ML Engine中成功运行(maxParallelTrials:1),但在多个并行试验中失败(maxParallelTrials:3)。

以前有人遇到过这个错误吗?关于如何解决它有什么想法吗?

这是我的配置文件:

trainingInput:
 scaleTier: CUSTOM
 masterType: standard
 workerType: standard
 parameterServerType: standard
 workerCount: 4
 parameterServerCount: 1
 hyperparameters:
   goal: MAXIMIZE
   maxTrials: 5
   maxParallelTrials: 3
   enableTrialEarlyStopping: False
   hyperparameterMetricTag: auc
   params:
   - parameterName: learning_rate
    type: DOUBLE
    minValue: 0.0001
    maxValue: 0.01
    scaleType: UNIT_LOG_SCALE
   - parameterName: optimizer
    type: CATEGORICAL
    categoricalValues:
    - Adam
    - Adagrad
    - Momentum
    - SGD
   - parameterName: batch_size
    type: DISCRETE
    discreteValues:
    - 128
    - 256
    - 512

这就是我设置 NanTensorHook 的方式:

hook = tf.train.NanTensorHook(loss,fail_on_nan_loss=False)

train_op = tf.contrib.layers.optimize_loss(
    loss=loss, global_step=tf.train.get_global_step(),
    learning_rate=lr, optimizer=optimizer)

model_fn = tf.estimator.EstimatorSpec(mode=mode, loss=loss,
    eval_metric_ops=eval_metric_ops, train_op=train_op,
    training_hooks=[hook])

我收到的错误消息是:

Hyperparameter Tuning Trial #4 Failed before any other successful 
trials were completed. The failed trial had parameters: optimizer=SGD, 
batch_size=128, learning_rate=0.00075073617775056709, . The trial's ror 
message was: The replica worker 1 exited with a non-zero status of 1. 
Termination reason: Error. Traceback (most recent call last): [...] 
File "/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/estimator/training.py", line 421, in 
train_and_evaluate executor.run() File "/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/estimator/training.py", line 522, in run 
getattr(self, task_to_run)() File "/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/estimator/training.py", line 532, in 
run_worker return self._start_distributed_training() File 
"/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/estimator/training.py", line 715, in 
_start_distributed_training saving_listeners=saving_listeners) File 
"/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/estimator/estimator.py", line 352, in train 
loss = self._train_model(input_fn, hooks, saving_listeners) File 
"/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/estimator/estimator.py", line 891, in 
_train_model _, loss = mon_sess.run([estimator_spec.train_op, 
estimator_spec.loss]) File "/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/training/monitored_session.py", line 546, in 
run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/training/monitored_session.py", line 1022, 
in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/training/monitored_session.py", line 1113, 
in run raise six.reraise(*original_exc_info) File 
"/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/training/monitored_session.py", line 1098, 
in run return self._sess.run(*args, **kwargs) File 
"/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/training/monitored_session.py", line 1178, 
in run run_metadata=run_metadata)) File "/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/training/basic_session_run_hooks.py", line 
617, in after_run raise NanLossDuringTrainingError 
NanLossDuringTrainingError: NaN loss during training. The replica 
worker 3 exited with a non-zero status of 1. Termination reason: Error. 
Traceback (most recent call last): [...] File 
"/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/estimator/training.py", line 421, in 
train_and_evaluate executor.run() File "/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/estimator/training.py", line 522, in run 
getattr(self, task_to_run)() File "/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/estimator/training.py", line 532, in 
run_worker return self._start_distributed_training() File 
"/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/estimator/training.py", line 715, in 
_start_distributed_training saving_listeners=saving_listeners) File 
"/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/estimator/estimator.py", line 352, in train 
loss = self._train_model(input_fn, hooks, saving_listeners) File 
"/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/estimator/estimator.py", line 891, in 
_train_model _, loss = mon_sess.run([estimator_spec.train_op, 
estimator_spec.loss]) File "/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/training/monitored_session.py", line 546, in 
run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/training/monitored_session.py", line 1022, 
in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/training/monitored_session.py", line 1113, 
in run raise six.reraise(*original_exc_info) File 
"/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/training/monitored_session.py", line 1098, 
in run return self._sess.run(*args, **kwargs) File 
"/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/training/monitored_session.py", line 1178, 
in run run_metadata=run_metadata)) File "/usr/local/lib/python2.7/dist- 
packages/tensorflow/python/training/basic_session_run_hooks.py", line 
617, in after_run raise NanLossDuringTrainingError 
NanLossDuringTrainingError: NaN loss during training. 

先谢谢大家了!

最佳答案

超参数调整作业中的不同试验在运行时是隔离的。因此,为一次试验添加的钩子(Hook)不会受到其他试验中其他钩子(Hook)的影响。

我怀疑该问题是由试验的超参数的特定组合引起的。为了确认这一点,我建议您使用失败试验的超参数值运行常规训练作业,并查看错误是否会再次发生。

您能否将项目编号和职位 ID 发送至 cloudml-feedback@google.com,我们可以进行更多调查。

关于tensorflow - 使用 ML Engine 进行超参数微调 : Nan error when running with parallel trials,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49737396/

相关文章:

java - Tensorflow java api - 第一个预测时间与第二个或更多

filter - 如何在 tensorflow 上可视化学习到的过滤器

java - 我的值一直显示为 NaN

javascript - 尝试使用从 JS 文本框中检索的数字进行计算时收到 NaN

python - 在 Pandas 中填充 NaN 的复杂案例

javascript - TensorFlow.js PoseNet 模型关键点动画

python - tensorflow 中使用的钩子(Hook)是什么意思

machine-learning - 我应该将可导出的特征添加到特征向量中吗?

machine-learning - 从短信中提取主题

python - 拟合神经网络的训练误差