tensorflow - Keras Multi GPU 示例给出 ResourceExhaustedError

标签 tensorflow keras multi-gpu

所以我尝试在 Keras 中使用多个 GPU。当我运行training_utils.py时使用示例程序(在training_utils.py代码中作为注释给出),我最终得到了ResourceExhaustedError。 nvidia-smi 告诉我四个 GPU 中几乎没有一个在工作。使用一个 GPU 对于其他程序来说效果很好。

  • TensorFlow 1.3.0
  • Keras 2.0.8
  • Ubuntu 16.04
  • CUDA/cuDNN 8.0/6.0

问题:有人知道这里发生了什么吗?

控制台输出:

(...)

2017-10-26 14:39:02.086838: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ***************************************************************************************************x 2017-10-26 14:39:02.086857: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[128,55,55,256] Traceback (most recent call last): File "test.py", line 27, in parallel_model.fit(x, y, epochs=20, batch_size=256) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/training.py", line 1631, in fit validation_steps=validation_steps) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/training.py", line 1213, in _fit_loop outs = f(ins_batch) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 2331, in call **self.session_kwargs) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run run_metadata_ptr) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run feed_dict_tensor, options, run_metadata) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run options, run_metadata) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[128,55,55,256] [[Node: replica_1/xception/block3_sepconv2/separable_conv2d = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:1"](replica_1/xception/block3_sepconv2/separable_conv2d/depthwise, block3_sepconv2/pointwise_kernel/read/_2103)]] [[Node: training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter/_4511 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_25380_training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]

Caused by op u'replica_1/xception/block3_sepconv2/separable_conv2d', defined at: File "test.py", line 19, in parallel_model = multi_gpu_model(model, gpus=2) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/utils/training_utils.py", line 143, in multi_gpu_model outputs = model(inputs) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py", line 603, in call output = self.call(inputs, **kwargs) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py", line 2061, in call output_tensors, _, _ = self.run_internal_graph(inputs, masks) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py", line 2212, in run_internal_graph output_tensors = _to_list(layer.call(computed_tensor, **kwargs)) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/layers/convolutional.py", line 1221, in call dilation_rate=self.dilation_rate) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 3279, in separable_conv2d data_format=tf_data_format) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/nn_impl.py", line 497, in separable_conv2d name=name) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 397, in conv2d data_format=data_format, name=name) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op op_def=op_def) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op original_op=self._default_original_op, op_def=op_def) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1204, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[128,55,55,256] [[Node: replica_1/xception/block3_sepconv2/separable_conv2d = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:1"](replica_1/xception/block3_sepconv2/separable_conv2d/depthwise, block3_sepconv2/pointwise_kernel/read/_2103)]] [[Node: training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter/_4511 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_25380_training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]

编辑(添加示例代码):

import tensorflow as tf
from keras.applications import Xception
from keras.utils import multi_gpu_model
import numpy as np

num_samples = 1000
height = 224
width = 224
num_classes = 100

with tf.device('/cpu:0'):
    model = Xception(weights=None,
                     input_shape=(height, width, 3),
                     classes=num_classes)

parallel_model = multi_gpu_model(model, gpus=4)
parallel_model.compile(loss='categorical_crossentropy',
                   optimizer='rmsprop')

x = np.random.random((num_samples, height, width, 3))
y = np.random.random((num_samples, num_classes))

parallel_model.fit(x, y, epochs=20, batch_size=128)

最佳答案

当在 GPU 上遇到 OOM/ResourceExhaustedError 时,我相信更改(减少)batch size 是首先尝试的正确选择。

For different GPU you may need different batch size based on the GPU memory you have.

最近我遇到了类似的问题,做了很多调整来进行不同类型的实验。

这里是 question 的链接(还包括一些技巧)。

但是,在减小批处理大小的同时,您可能会发现训练速度变慢。

关于tensorflow - Keras Multi GPU 示例给出 ResourceExhaustedError,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46954991/

相关文章:

python-3.x - 在 Pytorch 内置的自定义 Batchnorm 中更新 running_mean 和 running_var 有问题吗?

python - 训练卷积神经网络时准确率突然下降 50%

python - 如何在 TensorFlow GradientTape 中使用多个渐变?

python - 在tensorflow或keras中,我们如何近似像y=x^2这样的多项式函数?

c - OpenCL 多个 GPU 缓冲区读取失败

python - 具有一些不可训练权重的自定义 Keras 层

tensorflow - 有没有办法在Keras框架中使用global_step?

python - 从 keras 模型中提取特征到数据集中

tensorflow - 在本地计算机上反向图像搜索(用于图像重复)