python - 在没有 sudo 的情况下使用 GPU 在 Ubuntu 20.04 上设置 Tensorflow 2.4

标签 python tensorflow ubuntu ubuntu-20.04

我可以访问具有 Ubuntu 20.04 设置和 GPU 的虚拟机。系统管理员已经安装了最新的 Cuda 驱动程序,但不幸的是,这还不足以在 Tensorflow 中使用 GPU,因为当涉及到特定的 Cuda Toolkit + CuDNN 版本集时,每个版本的 TF 都可能非常挑剔。我没有 sudo 权限,所以我需要在本地安装所有内容。

nvidia-smi

返回驱动程序版本:465.19.01 CUDA 版本:11.3

python -c "import tensorflow as tf, logging; logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s'); tf.config.list_physical_devices('GPU');"

返回

2021-05-11 10:56:26.737279: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-05-11 10:56:26.737338: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-05-11 10:56:28.313896: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-05-11 10:56:28.315540: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-05-11 10:56:28.324232: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-11 10:56:28.324707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:00:05.0 name: NVIDIA Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2021-05-11 10:56:28.324867: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-11 10:56:28.325293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
pciBusID: 0000:00:06.0 name: NVIDIA Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2021-05-11 10:56:28.325438: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.325563: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.325706: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.325820: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.325931: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.326028: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.326117: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.326215: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2021-05-11 10:56:28.326230: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

这表明 GPU 不会在 TF 应用程序中使用。

我不得不花一些时间来设置 VM,所以我将在下面逐步发布我的解决方案。

最佳答案

在没有管理员权限的 Ubuntu 20.04 环境中设置 Tensorflow 2.4.x(针对 2.4.1 测试)的说明。假定系统管理员已经安装了最新的 Cuda 驱动程序。它包括安装 Cuda 11.0 工具包 + CuDNN 8.2.0。

以下说明将在目录/home/pherath/cuda_toolkits/cuda-11.0 下安装 CUDA 11.0(经测试适用于 Tensorflow 2.4.1),无需 sudo 权限。

第 1 步。下载 CUDA 11.0

wget http://developer.download.nvidia.com/compute/cuda/11.0.2/local_installers/cuda_11.0.2_450.51.05_linux.run
chmod +x cuda_11.0.2_450.51.05_linux.run

第 2 步,选项 1:对于快速自动化表单,请使用以下内容

./cuda_11.0.2_450.51.05_linux.run --silent --tmpdir=. --toolkit --toolkitpath=/home/pherath/cuda_toolkits/cuda-11.0

第 2 步,选项 2:这是一个可视化的分步指南

./cuda_11.0.2_450.51.05_linux.run

Continue, then accept the EULA.

Leave only Cuda Toolkit checked, uncheck everything else. Then go to Options.

Go into Toolkit Options.

Uncheck everything, then go to Change Toolkit Install Path and replace it with /home/pherath/cuda_toolkits/cuda-11.0 After this step, proceed with Install.

第 3 步. 下载 CUDA 11.0 补丁

wget https://developer.download.nvidia.com/compute/cuda/11.0.3/local_installers/cuda_11.0.3_450.51.06_linux.run
chmod +x cuda_11.0.3_450.51.06_linux.run

第 4 步。选项 1:快速静音模式

./cuda_11.0.3_450.51.06_linux.run --silent --tmpdir=. --toolkit --toolkitpath=/home/pherath/cuda_toolkits/cuda-11.0

第 4 步。选项 2:GUI 模式 重复步骤 2、选项 2 的确切步骤。

安装可能会出错。 When checking the logs, the error I saw suggests that there might be a bug in the installation script. The only offending term is the symbolic link of one file.

[ERROR]: boost::filesystem::create_symlink: File exists: "libcuinj64.so.11.0", "/home/pherath/cuda_toolkits/cuda-11.0/targets/x86_64-linux/lib/libcuinj64.so"

我在各种分发尝试中遇到了其他几个单一错误(例如,在 Ubuntu 16.04 上):
libcuinj64.so.11.0, libaccinj64.so.11.0, libnvrtc-builtins.so.11.0

可以使用以下两行修复此错误

cd /home/pherath/cuda_toolkits/cuda-11.0/targets/x86_64-linux/lib # move to the dir of the offending line
ln -s libaccinj64.so.11.0 libaccinj64.so #reorder such that symbolic link and target are in correct order (we need libaccinj64.so -> libaccinj64.so.11.0)

第 5 步。下载 CuDNN 8.2.0

cd /home/pherath/cuda_toolkits # move back to the parent of previous dir

您需要从 CuDNN archives 下载 CuDNN .tgz 文件,我用的是v8.2.0。此步骤将要求您在 CuDNN 创建一个帐户并通过 Web 界面下载。如果您在设置 tensorflow 的机器上没有 GUI,我建议使用“Link Redirect Trace”附加组件来跟踪将从中下载文件的确切链接 (here is a google chrome add-on link)。您可以使用带有 GUI 的本地计算机跟踪链接,然后使用 wget 在 VM 上下载跟踪的链接。请注意,此跟踪链接的生命周期相对较短。

下载后名称还是加密的,重命名回.tgz

mv $some_ambiguous_name cudnn-11.3-linux-x64-v8.2.0.53.tgz

现在我们在cuda安装目录的父级解压

tar -xvzf cudnn-11.3-linux-x64-v8.2.0.53.tgz # this will extract things under a dir called 'cuda'

现在我们需要复制所有lib64并包含到cuda工具包安装下相应的目录

cp -fv cuda/lib64/*.* cuda-11.0/lib64/.
cp -fv cuda/include/*.* cuda-11.0/include/.

第 6 步。创建/附加/前置 PATH 和 LD_LIBRARY_PATH 环境变量。

将以下行添加到您的 ~/.bashrc 的末尾(否则,请确保为您将从中运行 TF 脚本的每个 bash 扩展相应的环境变量)。

export CUDA11=/home/pherath/cuda_toolkits/cuda-11.0
export PATH=$CUDA11/bin:$PATH
export LD_LIBRARY_PATH=$CUDA11/lib64:$CUDA11/extras/CUPTI/lib64:$LD_LIBRARY_PATH

启动新终端或

source ~/.bashrc 

在每个现有终端中。

检查安装是否成功

您可以运行以下行来测试 TF 2.4.1 + profiler 是否有效:

conda create -n tf python==3.7 -y  # create a python environment
conda activate tf #activate the virtual environment (here conda)
pip install tensorflow==2.4.1 # install tf 2.4.1
python -c "import tensorflow as tf, logging; logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s'); tf.config.list_physical_devices('GPU'); tf.profiler.experimental.start('.'); tf.profiler.experimental.stop()" # test to see if TF with GPU works

############################################# ##########################

如果您想在 Ubuntu 20.04 LTS 上安装 Cuda Toolkit 10.2,则单行安装代码会相应更改(需要添加 library_path,并覆盖 gcc 版本不匹配的投诉)。

./cuda_10.2.89_440.33.01_linux.run --silent --tmpdir=. --toolkit --toolkitpath=/home/pherath/cuda_toolkits/cuda-10.2 --librarypath=/home/pherath/cuda_toolkits/cuda-10.2 --override

请记住,您还需要为 cuda 工具包 10.2 的补丁重复此过程。之后,您需要下载相应的 cuDNN 并将 lib64 和 include 复制到 cuda 工具包的目录中(与上述说明相同)。

############################################# ##########################

如果仍然出现错误,很可能是您没有安装正确的 cuda/nvidia 驱动程序。要解决此问题,您需要 sudo 权限!

1.

首先,清除所有 cuda/nvidia 内容(由于声誉有限,我无法添加引用..);基本上以 sudo 权限运行以下行。 易于清理;易于更新;易于清除cuda;易于清除 nvidia-*;易于自动删除; apt 安装 cuda

2.

按照谷歌的说明进行操作 https://cloud.google.com/compute/docs/gpus/install-drivers-gpu#ubuntu-driver-steps

3.

重启机器。

关于python - 在没有 sudo 的情况下使用 GPU 在 Ubuntu 20.04 上设置 Tensorflow 2.4,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67483626/

相关文章:

python - QListWidget显示更多项目

python - 导入邻居模块时如何正确使用导入

python - Django 安装问题 No module named django.core

bash - Ubuntu Server (ARM) Bash 脚本未写入文件

python - 使用单元中心的 X、Y、Z 坐标创建 3D 网格

python - 为什么tensorflow中的 `tf.nn.nce_loss`无法在GPU上运行?

python - 卷积层和输入数据上的 Keras 错误

TensorFlow:是否可以在忽略 NaN 值的同时减少总和?

PHP如何发送USR1信号进行处理?

python - 尝试使用 python-apt API 安装包时出错