linux - 为什么 slurm 中的作业是 TensorFlow 脚本时会无限期卡住?

标签 linux tensorflow slurm

我在使用 slurm ( http://slurm.schedmd.com/ ) 工作负载管理器时遇到此错误。当我运行一些 tensorflow python 脚本时,有时会导致错误(附件)。似乎找不到安装的 cuda 库,但我正在运行不需要 GPU 的脚本。因此,我很困惑为什么 cuda 会成为一个问题。如果我不需要 cuda 安装,为什么会出现问题?

我从 slurm-job_id 文件中获得的唯一有用信息如下:

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:102] Couldn't open CUDA library libcudnn.so. LD_LIBRARY_PATH: /cm/shared/openmind/cuda/7.5/lib64:/cm/shared/openmind/cuda/7.5/lib
I tensorflow/stream_executor/cuda/cuda_dnn.cc:2092] Unable to load cuDNN DSO
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: node047
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: node047
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:347] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  352.63  Sat Nov  7 21:25:42 PST 2015
GCC version:  gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 352.63.0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.

我一直认为 tensorflow 不需要 GPU。所以我假设最后一个错误说没有 GPU 不会导致错误(如果我错了请纠正我)。

我不明白为什么我需要 CUDA 库。我正在尝试使用 GPU 运行我的作业,如果我的作业是 CPU 作业,为什么我需要 cuda 库?


我尝试直接登录节点并启动 tensorflow,但没有出现明显的错误:

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:102] Couldn't open CUDA library libcudnn.so. LD_LIBRARY_PATH: /cm/shared/openmind/cuda/7.5/lib64:/cm/shared/openmind/cuda/7.5/lib
I tensorflow/stream_executor/cuda/cuda_dnn.cc:2092] Unable to load cuDNN DSO
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally

虽然我预料到错误:

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:102] Couldn't open CUDA library libcudnn.so. LD_LIBRARY_PATH: /cm/shared/openmind/cuda/7.5/lib64:/cm/shared/openmind/cuda/7.5/lib
I tensorflow/stream_executor/cuda/cuda_dnn.cc:2092] Unable to load cuDNN DSO
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: node047
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: node047
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:347] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  352.63  Sat Nov  7 21:25:42 PST 2015
GCC version:  gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 352.63.0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.

我还在tensorflow库中做了一个官方的git issue:

https://github.com/tensorflow/tensorflow/issues/3632

最佳答案

在通过批处理作业使用 slurm 提交的 tensorflow 中存在一些错误。

目前我通过在 slurm 上运行 srun 来绕过它。

在您的情况下,您还安装了 GPU 版本的 tensorflow 并在没有 GPU 的机器上运行它。这在您的案例中导致了另一个错误。

关于linux - 为什么 slurm 中的作业是 TensorFlow 脚本时会无限期卡住?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38727375/

相关文章:

apache-spark - 获取用于调优的节点数、代码数和可用 RAM

stdin - 是否可以在阻塞模式下运行 sbatch?

python - 如何获取全局步骤变量的确定性值(严格在增量之后或严格之前)

python - 为什么二进制 Keras CNN 总是预测 1?

linux - 当我激活 conda 环境时,用户名和当前目录在终端中消失

c - 在 OpenGL 中设置透明背景颜色不起作用

mysql - 使用两个不同数据库时为"Mysql::Error: query: not connected"

linux - 编译期间未找到共享库

linux - Anaconda Python for Gurobi 6.0.4安装报错_Py_FalseStruct

python - TensorFlow安装问题Mac