python - 在 Mac 上的 Docker 中运行 Keras 时出现内存问题

标签 python python-2.7 docker keras docker-machine

在 Mac 上的 Docker 机器中运行 Keras 训练算法时,会导致各种内存问题。

  • 训练算法在 Docker 之外的同一台机器上运行良好

  • 将 Docker 内存从 1 GB 设置为 8 GB(机器限制)没有帮助

  • 最大视频内存:128 MB

  • 从 Docker 提取的不同 TensorFlow(0.10.00.11.0)和 Theano 后端都显示类似的错误

  • 可能发生冲突的其他 Docker 进程列表 docker ps -a 为空

问题是,我在与 Docker 相同的机器上运行相同的训练算法时,性能低得多。所有错误都指向内存管理问题:

1) 最初的错误是 MemoryError,在容器的 docker build 过程中运行训练脚本时,它在训练开始之前就退出了该过程。

2)现在我在运行docker run 058785edc11d python train.py --run一次后,在分配形状为[64,64,254,254]的张量时出现OOM容器已构建(似乎更进一步):

Training..
Train on 385 samples, validate on 40 samples
Epoch 1/1
    sample_weight=sample_weight)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1046, in fit
    callback_metrics=callback_metrics)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 784, in _fit_loop
    outs = f(ins_batch)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 641, in __call__
    updated = session.run(self.outputs + self.updates, feed_dict=feed_dict)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 382, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 655, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 723, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 743, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[64,64,254,254]
     [[Node: transpose_2 = Transpose[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](Conv2D, transpose_2/perm)]]
Caused by op u'transpose_2', defined at:
  File "train.py", line 138, in <module>
    run(extract=extract_mode, cont=continue_)
  File "train.py", line 79, in run
    model = m.get_model(n_outputs=num_categories, input_size=size)
  File "/tmp/model.py", line 24, in get_model
    conv.add(Convolution2D(64, 3, 3, activation='relu', input_shape=(3, input_size, input_size)))
  File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 110, in add
    layer.create_input_layer(batch_input_shape, input_dtype)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 341, in create_input_layer
    self(x)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 485, in __call__
    self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 543, in add_inbound_node
    Node.create_node(self, inbound_layers, node_indices, tensor_indices)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 148, in create_node
    output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
  File "/usr/local/lib/python2.7/dist-packages/keras/layers/convolutional.py", line 341, in call
    filter_shape=self.W_shape)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 997, in conv2d
    x = tf.transpose(x, (0, 3, 1, 2))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 1051, in transpose
    ret = gen_array_ops.transpose(a, perm, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 2489, in transpose
    result = _op_def_lib.apply_op("Transpose", x=x, perm=perm, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in __init__
    self._traceback = _extract_stack()

3)删除退出的 docker 容器后,减少训练批量大小,我得到 std::bad_alloc:

Training..
Train on 404 samples, validate on 21 samples
Epoch 1/1
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

4)另一个常见错误资源耗尽:分配形状为[25088,4096]的张量时出现OOM

$ docker run f825faab715c python train.py --run --continue
libdc1394 error: Failed to initialize libdc1394
Using TensorFlow backend.
/tmp/data.py:134: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  val = np.random.choice(dataset_indx, size=number_of_samples)
/tmp/data.py:127: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  train = np.random.choice(dataset_indx, size=number_of_samples)
Loading data..
Number of categories: 2
Number of samples 425
Building and Compiling model..
W tensorflow/core/framework/op_kernel.cc:936] Resource exhausted: OOM when allocating tensor with shape[25088,4096]
W tensorflow/core/framework/op_kernel.cc:936] Resource exhausted: OOM when allocating tensor with shape[4096,4096]
W tensorflow/core/framework/op_kernel.cc:936] Resource exhausted: OOM when allocating tensor with shape[25088,4096]
     [[Node: gradients/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/cpu:0"](cond_5/Merge, gradients/add_43_grad/Reshape)]]
E tensorflow/core/client/tensor_c_api.cc:485] OOM when allocating tensor with shape[25088,4096]
     [[Node: gradients/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/cpu:0"](cond_5/Merge, gradients/add_43_grad/Reshape)]]
Training..
Train on 404 samples, validate on 21 samples
Epoch 1/1
Traceback (most recent call last):
  File "train.py", line 138, in <module>
    run(extract=extract_mode, cont=continue_)
  File "train.py", line 100, in run
    sample_weight=None)
  File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 405, in fit
    sample_weight=sample_weight)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1046, in fit
    callback_metrics=callback_metrics)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 784, in _fit_loop
    outs = f(ins_batch)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 641, in __call__
    updated = session.run(self.outputs + self.updates, feed_dict=feed_dict)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 382, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 655, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 723, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 743, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[25088,4096]
     [[Node: gradients/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/cpu:0"](cond_5/Merge, gradients/add_43_grad/Reshape)]]
Caused by op u'gradients/MatMul_grad/MatMul_1', defined at:
  File "train.py", line 138, in <module>
    run(extract=extract_mode, cont=continue_)
  File "train.py", line 100, in run
    sample_weight=None)
  File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 405, in fit
    sample_weight=sample_weight)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1031, in fit
    self._make_train_function()
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 658, in _make_train_function
    training_updates = self.optimizer.get_updates(trainable_weights, self.constraints, self.total_loss)
  File "/usr/local/lib/python2.7/dist-packages/keras/optimizers.py", line 307, in get_updates
    grads = self.get_gradients(loss, params)
  File "/usr/local/lib/python2.7/dist-packages/keras/optimizers.py", line 48, in get_gradients
    grads = K.gradients(loss, params)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 666, in gradients
    return tf.gradients(loss, variables)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients.py", line 478, in gradients
    in_grads = _AsList(grad_fn(op, *out_grads))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_grad.py", line 637, in _MatMulGrad
    math_ops.matmul(op.inputs[0], grad, transpose_a=True))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1346, in matmul
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1271, in _mat_mul
    transpose_b=transpose_b, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in __init__
    self._traceback = _extract_stack()

...which was originally created as op u'MatMul', defined at:
  File "train.py", line 138, in <module>
    run(extract=extract_mode, cont=continue_)
  File "train.py", line 79, in run
    model = m.get_model(n_outputs=num_categories, input_size=size)
  File "/tmp/model.py", line 70, in get_model
    conv.add(Dense(4096))
  File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 142, in add
    output_tensor = layer(self.outputs[0])
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 485, in __call__
    self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 543, in add_inbound_node
    Node.create_node(self, inbound_layers, node_indices, tensor_indices)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 148, in create_node
    output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
  File "/usr/local/lib/python2.7/dist-packages/keras/layers/core.py", line 628, in call
    output = K.dot(x, self.W)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 214, in dot
    out = tf.matmul(x, y)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1346, in matmul
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1271, in _mat_mul
    transpose_b=transpose_b, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in __init__
    self._traceback = _extract_stack()

最佳答案

可能您的训练算法需要超过 8GB 的​​内存。我以前遇到过这样的问题,但是增加内存总是能解决问题。您的错误ResourceExhaustedError:OOM在分配形状为[64,64,254,254]的张量时清楚地表明您已经耗尽了资源,并且需要更多内存来运行您的应用程序。

关于python - 在 Mac 上的 Docker 中运行 Keras 时出现内存问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40818579/

相关文章:

python - JTables 和 Jython 事件

python - python 检查奇数

python - 按值查找相同的字典

docker - Microsoft.DotNet.Docker.CommandLineClientException:客户端版本1.22太旧

docker - 在没有 sudo 的情况下运行 docker 命令的子集

Docker本地存储库删除不释放空间

python - 迭代字典产生要拆分为字符的字符串值

python - 低级select.poll()从子进程读取

python - 查找列表元素的所有组合,包括重复元素

Python 函数和空字符串