python - 如何解决 dist.init_process_group 挂起(或死锁)？

我打算在 DGX A100 上设置 DDP(分布式数据并行)，但它不起作用。每当我尝试运行它时，它就会挂起。我的代码非常简单，只是为 4 个 gpu 生成了 4 个进程(为了调试，我只是立即销毁了该组，但它甚至没有到达那里):

def find_free_port():
    """ https://stackoverflow.com/questions/1365265/on-localhost-how-do-i-pick-a-free-port-number """
    import socket
    from contextlib import closing

    with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
        s.bind(('', 0))
        s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        return str(s.getsockname()[1])

def setup_process(rank, world_size, backend='gloo'):
    """
    Initialize the distributed environment (for each process).

    gloo: is a collective communications library (https://github.com/facebookincubator/gloo). My understanding is that
    it's a library/API for process to communicate/coordinate with each other/master. It's a backend library.

    export NCCL_SOCKET_IFNAME=eth0
    export NCCL_IB_DISABLE=1

    https://stackoverflow.com/questions/61075390/about-pytorch-nccl-error-unhandled-system-error-nccl-version-2-4-8

    https://pytorch.org/docs/stable/distributed.html#common-environment-variables
    """
    if rank != -1:  # -1 rank indicates serial code
        print(f'setting up rank={rank} (with world_size={world_size})')
        # MASTER_ADDR = 'localhost'
        MASTER_ADDR = '127.0.0.1'
        MASTER_PORT = find_free_port()
        # set up the master's ip address so this child process can coordinate
        os.environ['MASTER_ADDR'] = MASTER_ADDR
        print(f"{MASTER_ADDR=}")
        os.environ['MASTER_PORT'] = MASTER_PORT
        print(f"{MASTER_PORT}")

        # - use NCCL if you are using gpus: https://pytorch.org/tutorials/intermediate/dist_tuto.html#communication-backends
        if torch.cuda.is_available():
            # unsure if this is really needed
            # os.environ['NCCL_SOCKET_IFNAME'] = 'eth0'
            # os.environ['NCCL_IB_DISABLE'] = '1'
            backend = 'nccl'
        print(f'{backend=}')
        # Initializes the default distributed process group, and this will also initialize the distributed package.
        dist.init_process_group(backend, rank=rank, world_size=world_size)
        # dist.init_process_group(backend, rank=rank, world_size=world_size)
        # dist.init_process_group(backend='nccl', init_method='env://', world_size=world_size, rank=rank)
        print(f'--> done setting up rank={rank}')
        dist.destroy_process_group()

mp.spawn(setup_process, args=(4,), world_size=4)

为什么这是挂？
nvidia-smi 输出:

$ nvidia-smi
Fri Mar  5 12:47:17 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                    0 |
| N/A   26C    P0    51W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      On   | 00000000:0F:00.0 Off |                    0 |
| N/A   25C    P0    52W / 400W |      3MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      On   | 00000000:47:00.0 Off |                    0 |
| N/A   25C    P0    51W / 400W |      3MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      On   | 00000000:4E:00.0 Off |                    0 |
| N/A   25C    P0    51W / 400W |      3MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  A100-SXM4-40GB      On   | 00000000:87:00.0 Off |                    0 |
| N/A   30C    P0    52W / 400W |      3MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  A100-SXM4-40GB      On   | 00000000:90:00.0 Off |                    0 |
| N/A   29C    P0    53W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  A100-SXM4-40GB      On   | 00000000:B7:00.0 Off |                    0 |
| N/A   29C    P0    52W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  A100-SXM4-40GB      On   | 00000000:BD:00.0 Off |                    0 |
| N/A   48C    P0   231W / 400W |   7500MiB / 40537MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    7   N/A  N/A    147243      C   python                           7497MiB |
+-----------------------------------------------------------------------------+

我如何在这台新机器上设置 ddp？

更新
顺便说一句，我已经成功安装了 APEX，因为其他一些链接说这样做，但它仍然失败。因为我做到了:
去了:https://github.com/NVIDIA/apex听从他们的指示

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

但在上述I had to update gcc之前:

conda install -c psi4 gcc-5

它确实在我成功导入时安装了它，但没有帮助。

现在它实际上打印了一个错误消息:

Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
KeyboardInterrupt
Process SpawnProcess-3:
Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 252, in train
    setup_process(rank, world_size=opts.world_size)
  File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/distributed.py", line 85, in setup_process
    dist.init_process_group(backend, rank=rank, world_size=world_size)
  File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: connect() timed out.

During handling of the above exception, another exception occurred:

有关的:

https://github.com/pytorch/pytorch/issues/9696

https://discuss.pytorch.org/t/dist-init-process-group-hangs-silently/55347/2

https://forums.developer.nvidia.com/t/imagenet-hang-on-dgx-1-when-using-multiple-gpus/61919

顶点建议:https://discourse.mozilla.org/t/hangs-on-dist-init-process-group-in-distribute-py/44686

https://github.com/pytorch/pytorch/issues/15638

https://github.com/pytorch/pytorch/issues/53395

最佳答案

以下修复基于 Writing Distributed Applications with PyTorch, Initialization Methods .
第一期:
除非你传入 nprocs=world_size 否则它会挂起至 mp.spawn() .换句话说，它正在等待“整个世界”出现，过程明智。

第 2 期:
MASTER_ADDR 和 MASTER_PORT 在每个进程的环境中需要相同，并且需要是运行 rank 0 进程的机器上的空闲地址:端口组合。

这两个都是暗示或直接从上面链接中的以下引用中读取的(已添加强调):

Environment Variable

We have been using the environment variable initialization method throughout this tutorial. By setting the following four environment variables on all machines, all processes will be able to properly connect to the master, obtain information about the other processes, and finally handshake with them.

MASTER_PORT: A free port on the machine that will host the process with rank 0.

MASTER_ADDR: IP address of the machine that will host the process with rank 0.

WORLD_SIZE: The total number of processes, so that the master knows how many workers to wait for.

RANK: Rank of each process, so they will know whether it is the master of a worker.

下面是一些代码来演示这两个操作:

import torch
import torch.multiprocessing as mp
import torch.distributed as dist
import os

def find_free_port():
    """ https://stackoverflow.com/questions/1365265/on-localhost-how-do-i-pick-a-free-port-number """
    import socket
    from contextlib import closing

    with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
        s.bind(('', 0))
        s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        return str(s.getsockname()[1])


def setup_process(rank, master_addr, master_port, world_size, backend='gloo'):
    print(f'setting up {rank=} {world_size=} {backend=}')

    # set up the master's ip address so this child process can coordinate
    os.environ['MASTER_ADDR'] = master_addr
    os.environ['MASTER_PORT'] = master_port
    print(f"{master_addr=} {master_port=}")

    # Initializes the default distributed process group, and this will also initialize the distributed package.
    dist.init_process_group(backend, rank=rank, world_size=world_size)
    print(f"{rank=} init complete")
    dist.destroy_process_group()
    print(f"{rank=} destroy complete")
        
if __name__ == '__main__':
    world_size = 4
    master_addr = '127.0.0.1'
    master_port = find_free_port()
    mp.spawn(setup_process, args=(master_addr,master_port,world_size,), nprocs=world_size)

关于python - 如何解决 dist.init_process_group 挂起(或死锁)？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/66498045/

python - 如何解决 dist.init_process_group 挂起(或死锁)？

上一篇：html - Vue v-html 指令不运行 <script src =".."> 标签

下一篇：azure - 公开 vnet 内的 Azure 容器实例的正确方法是什么？

python - 如何解决 dist.init_process_group 挂起(或死锁)？

上一篇：html - Vue v-html 指令不运行 &lt;script src =".."> 标签

下一篇：azure - 公开 vnet 内的 Azure 容器实例的正确方法是什么？

上一篇：html - Vue v-html 指令不运行 <script src =".."> 标签