python - 分布式 TensorFlow [异步，图间复制] : which are the exactly interaction between workers and servers regarding Variables update

我读过 Distributed TensorFlow Doc和 this question on StackOverflow但我仍然对可以使用 TensorFlow 及其参数服务器架构完成的分布式训练背后的动力持怀疑态度。这是分布式 TensorFlow 文档中的一段代码:

if FLAGS.job_name == "ps":
    server.join()
  elif FLAGS.job_name == "worker":

    # Assigns ops to the local worker by default.
    with tf.device(tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d" % FLAGS.task_index,
        cluster=cluster)):

      # Build model...
      loss = ...
      global_step = tf.contrib.framework.get_or_create_global_step()

      train_op = tf.train.AdagradOptimizer(0.01).minimize(
          loss, global_step=global_step)

这里是我读到的 StackOverflow 问题的部分答案:

The worker reads all of the shared model parameters in parallel from the PS task(s), and copies them to the worker task. These reads are uncoordinated with any concurrent writes, and no locks are acquired: in particular the worker may see partial updates from one or more other workers (e.g. a subset of the updates from another worker may have been applied, or a subset of the elements in a variable may have been updated).

The worker computes gradients locally, based on a batch of input data and the parameter values that it read in step 1.

The worker sends the gradients for each variable to the appropriate PS task, and applies the gradients to their respective variable, using an update rule that is determined by the optimization algorithm (e.g. SGD, SGD with Momentum, Adagrad, Adam, etc.). The update rules typically use (approximately) commutative operations, so they may be applied independently on the updates from each worker, and the state of each variable will be a running aggregate of the sequence of updates received.

我必须在另一个环境中重现这种参数服务器架构，我需要深入了解 worker 和 PS 任务在 TensorFlow 框架内是如何交互的。我的问题是，PS 任务是在从工作人员那里收到值后进行某种合并或更新操作，还是只存储最新值？只存储最新的值可以是合理的吗？查看 TensorFlow 文档中的代码，我看到 PS 任务只是执行一个 join()，我想知道这个方法调用背后是 PS 任务的完整行为。

还有一个问题，计算梯度和应用梯度有什么区别？

最佳答案

让我们倒过来从你的最后一个问题开始:计算梯度和应用梯度有什么区别？

计算梯度意味着在计算损失后在网络上运行反向传播。对于梯度下降，这意味着估计下面公式中的 gradients 值(注意:这是计算梯度实际需要的巨大简化，查看更多关于反向传播和梯度的信息下降以正确解释其工作原理)。应用梯度意味着根据您刚刚计算的梯度更新参数。对于梯度下降，这(大致)意味着执行以下操作:

weights = weights - (learning_step * gradients)

请注意，根据 learning_step 的值，weights 的新值取决于先前的值和计算的权重。

记住这一点，就更容易理解 PS/worker 架构了。让我们做一个简单的假设，只有一个 PS(我们稍后会看到如何扩展到多 PS)

PS(参数服务器)在内存中保存权重(即参数)并接收梯度，运行我写的更新步骤在上面的代码中。每次从工作人员那里收到梯度时，它都会这样做。

另一方面，工作人员在 PS 中查找 weights 的当前值是多少，在本地制作一个副本，向前和向后运行在一批数据上传递网络并获得新的梯度，然后将其发送回 PS。

注意对“当前”的强调:worker 和 PS 之间没有锁定或进程间同步。如果工作人员在更新过程中读取 weights(例如，一半已经有了新值，另一半仍在更新)，这就是他将在下一次迭代中使用的权重。这使事情变得快速。

如果有更多 PS 怎么办？ 没问题!网络的参数在 PS 之间分区，工作人员只需联系所有这些参数以获得每个参数 block 的新值，并仅发回与每个特定 PS 相关的梯度。

关于python - 分布式 TensorFlow [异步，图间复制] : which are the exactly interaction between workers and servers regarding Variables update，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49150587/

python - 分布式 TensorFlow [异步，图间复制] : which are the exactly interaction between workers and servers regarding Variables update

上一篇：python - 使用 ArtistAnimation 的 Matplotlib 动画更新标题

下一篇：python - ItemIsAutoTristate 标志未按预期工作