python - JupyterHub 单用户无法通过 systemdspawner 使用 tensorflow gpu 支持

标签 python tensorflow gpu jupyter jupyterhub

(这是对 SOjupyterhub issue trackerjupyterhub/systemdspawner issue tracker 的交叉发布)

我有一个使用 SystemdSpawner 的私有(private) JupyterHub 设置,我尝试在 gpu 支持下运行 tensorflow。

我遵循了 tensorflow instructions,或者在 AWS EC2 g4 实例上尝试了一个已经配置好的 AWS AMI(深度学习基础 AMI(Ubuntu 18.04)版本 21.0)和 NDVIDIA。

在这两种设置上,我都可以在 (i)python 3.6 shell 中使用带有 gpu 支持的 tensorflow

>>> import tensorflow as tf
>>> tf.config.list_physical_devices('GPU')
2020-02-12 10:57:13.670937: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-12 10:57:13.698230: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-12 10:57:13.699066: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:1e.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.73GiB deviceMemoryBandwidth: 298.08GiB/s
2020-02-12 10:57:13.699286: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-12 10:57:13.700918: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-12 10:57:13.702512: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-12 10:57:13.702814: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-12 10:57:13.704561: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-12 10:57:13.705586: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-12 10:57:13.709171: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-12 10:57:13.709278: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-12 10:57:13.710120: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-12 10:57:13.710893: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

(关于NUMA节点的一些警告,但是找到了gpu)

还使用 nvidia-smideviceQuery 显示 gpu:
$ nvidia-smi
Wed Feb 12 10:39:44 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   33C    P8     9W /  70W |      0MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$ /usr/local/cuda/extras/demo_suite/deviceQuery
/usr/local/cuda/extras/demo_suite/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla T4"
  CUDA Driver Version / Runtime Version          10.1 / 10.0
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 15080 MBytes (15812263936 bytes)
  (40) Multiprocessors, ( 64) CUDA Cores/MP:     2560 CUDA Cores
  GPU Max Clock rate:                            1590 MHz (1.59 GHz)
  Memory Clock rate:                             5001 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 30
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.0, NumDevs = 1, Device0 = Tesla T4
Result = PASS

现在我启动 JupyterHub,登录并打开一个终端,我得到:
$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.


$ /usr/local/cuda/extras/demo_suite/deviceQuery
cuda/extras/demo_suite/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL

并且

Screenshot 2020-02-12 12 08 46

我怀疑某种“沙盒”、缺少 ENV 变量等,因为在单用户环境中找不到 gpu 驱动程序,因此 tensorflow gpu 支持不起作用。

对此有什么想法吗?可能它要么是一个小的配置调整,要么是由于架构根本无法解决;)

最佳答案

设置 c.SystemdSpawner.isolate_devices = False在您的 jupyterhub_config.py .

以下是 the documentation 的摘录:

Setting this to true provides a separate, private /dev for each user. This prevents the user from directly accessing hardware devices, which could be a potential source of security issues. /dev/null, /dev/zero, /dev/random and the ttyp pseudo-devices will be mounted already, so most users should see no change when this is enabled.

c.SystemdSpawner.isolate_devices = True

This requires systemd version > 227. If you enable this in earlier versions, spawning will fail.



Nvidia 使用设备(即 /dev 中的文件)。请引用their documentation for more information .那里应该有名为 /dev/nvidia* 的文件.使用 SystemdSpawner 隔离设备将阻止对这些 Nvidia 设备的访问。

Is there a way to still isolate the devices and enable GPU support?



我不确定......但我可以提供指向文档的指针。设置c.SystemdSpawner.isolate_devices = TruePrivateDevices=yes最终systemd-run调用(source)。引用 the systemd documentation有关 PrivateDevices 的更多信息选项。

您也许可以保留 isolate_devices = True然后显式安装 nvidia 设备。虽然我不知道该怎么做...

关于python - JupyterHub 单用户无法通过 systemdspawner 使用 tensorflow gpu 支持,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60187732/

相关文章:

Tensorflow 概率返回不稳定的预测

python - 在本地 gpu 和 colab gpu 上进行分布式训练

python - 如何方便的获取pytorch模块的设备类型?

python - 使用 python 在查询内传递变量

python - 在 python 中调用编辑器 ( vim )

python - 图像识别,如何开始

python - Cloud ML Engine 在线预测无法扩展

C (Windows) - GPU 使用率(负载百分比)

python - Django 的 TemplateResponse 在错误的目录中查找模板

python - 在Matlab中优化一个 "mask"函数