tensorflow - 使用 feed_dict 时的 Tensorflow 多 GPU 训练(数据并行)

我想使用多个 GPU 来利用数据并行性来训练我的 Tensorflow 模型。

我目前正在使用以下方法训练 Tensorflow 模型:

x_ = tf.placeholder(...)
y_ = tf.placeholder(...)
y = model(x_)
loss = tf.losses.sparse_softmax_cross_entropy(labels=y_, logits=y)
optimizer = tf.train.AdamOptimizer()
train_op = tf.contrib.training.create_train_op(loss, optimizer)
for i in epochs:
   for b in data:
      _ = sess.run(train_op, feed_dict={x_: b.x, y_: b.y})

我想利用多个 GPU 以数据并行方式训练该模型。也就是说，我想将批处理分成两半，并在两个 GPU 之一上运行每半批处理。

cifar10_multi_gpu_train似乎提供了一个很好的例子，说明创建从多个 GPU 上运行的图形中提取的损失，但我还没有找到在使用 feed_dict 和 placeholder< 时进行这种训练风格的好例子 而不是数据加载队列。

更新

看起来像:https://timsainb.github.io/multi-gpu-vae-gan-in-tensorflow.html可能会提供一个很好的例子。他们似乎从 cifar10_multi_gpu_train.py 中提取 average_gradients 并创建一个占位符，然后将其切入每个 GPU。我认为您还需要将 create_train_op 分为三个阶段:compute_gradients、average_gradients，然后 apply_gradients。

最佳答案

我知道在多 GPU 模型上提供数据的三种方法。

如果所有输入的形状相同，您可以在 CPU 上构建占位符 x，然后使用 tf.split 将 x 拆分为xs。然后在每个 GPU 塔上，获取 xs[i] 作为输入。

with tf.device("/cpu:0"):
    encoder_inputs = tf.placeholder(tf.int32, [None, None], name="encoder_inputs")
    encoder_length = tf.placeholder(tf.int32, [None,], name="encoder_length")

    # make sure batch % num_gpu == 0
    inputs = tf.split(encoder_inputs, axis=0)  # axis=0, split on batch dimension
    lens = tf.split(encoder_length, axis=0)

with tf.variable_scope(tf.get_variable_scope()):
    for i in range(num_gpus):
        with tf.device("/gpu:%d"%i):
            with tf.name_scope("tower_%d"%i):
                loss = compute_loss(inputs[i], lens[i])

如果您的输入具有不同的形状，则需要在每个具有范围的 GPU 上构建占位符 x。


def init_placeholder(self):
    with tf.variable_scope("inputs"):   # use a scope
        encoder_inputs = tf.placeholder(tf.int32, [None, None], name="encoder_inputs")
        encoder_length = tf.placeholder(tf.int32, [None,], name="encoder_length")
    return encoder_inputs, encoder_length

with tf.variable_scope(tf.get_variable_scope()):
    for g, gpu in enumerate(GPUS):
        with tf.device("/gpu:%d"%gpu):
            with tf.name_scope("tower_%d"%g):
                x, x_len = model.init_placeholder()  # these placeholder Tensor are on GPU
                loss = model.compute_loss(x, x_len)

使用tf.data.Dataset来提供数据。 google官方cifar10_multi_gpu_train.py使用Queue，与这种方式类似。

关于tensorflow - 使用 feed_dict 时的 Tensorflow 多 GPU 训练(数据并行)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43241829/

tensorflow - 使用 feed_dict 时的 Tensorflow 多 GPU 训练(数据并行)

上一篇：vba - GetObject ("SAPGUI"返回什么类型的对象)？

下一篇：ruby-on-rails-3 - 使用 RubyMine 调试 Rails 3