c++ - 多GPU模式下的tensorflow c++ SetDefaultDevice

标签 c++ tensorflow machine-learning deep-learning

我想在多个 GPU 上加载相同的图形以进行推理,但是我无法使用 graph::SetDefaultDevice 将图形与设备关联起来。该问题不是出现在 SetDefaultDevice 中,而是稍后在使用图形创建 session 时出现。这里是一个简单的例子,摘自tensorflow的example_trainer.cc

#include <tensorflow/core/platform/env.h>
#include <tensorflow/core/public/session.h>
#include "tensorflow/cc/ops/standard_ops.h"
#include "tensorflow/core/graph/default_device.h"

int main() {
  using namespace tensorflow;
  using namespace tensorflow::ops;
  Scope root = Scope::NewRootScope();
  auto A = Const(root, { {3.f, 2.f}, {-1.f, 0.f} }); 
  auto b = Const(root, { {3.f, 5.f} }); 
  auto v = MatMul(root.WithOpName("v"), A, b, MatMul::TransposeB(true));

  GraphDef def;
  TF_CHECK_OK(root.ToGraphDef(&def));

  graph::SetDefaultDevice(false ? "/device:GPU:0" : "/cpu:0", &def);
  /*
  for (auto &node: *def.mutable_node()) {
        node.set_device("/cpu:0");
        std::cout << node.name() << " = '" << node.device() <<"'"<< std::endl;
  }
  std::cout << "=======================\n";
  */
  SessionOptions options;
  std::unique_ptr<Session> session(NewSession(options));
  TF_CHECK_OK(session->Create(def));
  return 0;
}

运行时出现以下错误

2018-09-06 18:18:13.853316: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-09-06 18:18:13.856079: F /home/daniel/tensorflow_cc/example/example.cpp:27] Non-OK-status: session->Create(def) status: Not found: No attr named '/cpu:0' in NodeDef:
     [[Node: Const = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [2,2] values: [3 2][-1]...>, _device="/cpu:0"]()]]
     [[Node: Const = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [2,2] values: [3 2][-1]...>, _device="/cpu:0"]()]]
Aborted (core dumped)

如果我删除 SetDefault Device 调用,它就可以正常工作。我也尝试过在具有 GPU 的机器上执行此操作,但没有成功。

我知道问题不在于 SetDefaultDevice,因为在创建 session 时手动设置每个节点的设备最终会出现相同的问题。

Const = '/cpu:0'
Const_1 = '/cpu:0'
v = '/cpu:0'
=======================
2018-09-06 18:15:05.966337: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-09-06 18:15:05.969048: F /home/daniel/tensorflow_cc/example/example.cpp:26] Non-OK-status: session->Create(def) status: Not found: No attr named '/cpu:0' in NodeDef:
     [[Node: Const = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [2,2] values: [3 2][-1]...>, _device="/cpu:0"]()]]
     [[Node: Const = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [2,2] values: [3 2][-1]...>, _device="/cpu:0"]()]]
Aborted (core dumped)

最佳答案

这似乎仅是整体构建(--config=monolithic)的问题,即构建 libtensorflow_cc.so 时。我不确定,但可能与

https://github.com/tensorflow/tensorflow/issues/5379 https://github.com/tensorflow/tensorflow/issues/16291

关于c++ - 多GPU模式下的tensorflow c++ SetDefaultDevice,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52207439/

相关文章:

tensorflow - 使用 Tensorflow 的 Connectionist 时间分类 (CTC) 实现

machine-learning - 在模式识别问题中使用什么更好?机器学习还是神经网络?

c++ - 使用C++以漂亮的方式打印二叉树

c++ - Eclipse CDT : fatal error: NewLib/MathFuncsLib. h: 没有那个文件或目录

c++ - 迭代卡汉求和的优化实现

tensorflow - key 错误 : "The name ' boosted_trees/QuantileAccumulator/' refers to an Operation not in the graph." when loading saved model

C++ 模板 : Inlined code and Compiler Optimzations

python - 基于自定义数据的训练模型

machine-learning - 为什么执行交叉验证后输出会发生变化?

apache-spark - 如何使用 Spark 正确获取合成数据集的权重?