python - 如何解决 dist.init_process_group 挂起(或死锁)?

标签 python machine-learning pytorch gpu multi-gpu

我打算在 DGX A100 上设置 DDP(分布式数据并行),但它不起作用。每当我尝试运行它时,它就会挂起。我的代码非常简单,只是为 4 个 gpu 生成了 4 个进程(为了调试,我只是立即销毁了该组,但它甚至没有到达那里):

def find_free_port():
    """ https://stackoverflow.com/questions/1365265/on-localhost-how-do-i-pick-a-free-port-number """
    import socket
    from contextlib import closing

    with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
        s.bind(('', 0))
        s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        return str(s.getsockname()[1])

def setup_process(rank, world_size, backend='gloo'):
    """
    Initialize the distributed environment (for each process).

    gloo: is a collective communications library (https://github.com/facebookincubator/gloo). My understanding is that
    it's a library/API for process to communicate/coordinate with each other/master. It's a backend library.

    export NCCL_SOCKET_IFNAME=eth0
    export NCCL_IB_DISABLE=1

    https://stackoverflow.com/questions/61075390/about-pytorch-nccl-error-unhandled-system-error-nccl-version-2-4-8

    https://pytorch.org/docs/stable/distributed.html#common-environment-variables
    """
    if rank != -1:  # -1 rank indicates serial code
        print(f'setting up rank={rank} (with world_size={world_size})')
        # MASTER_ADDR = 'localhost'
        MASTER_ADDR = '127.0.0.1'
        MASTER_PORT = find_free_port()
        # set up the master's ip address so this child process can coordinate
        os.environ['MASTER_ADDR'] = MASTER_ADDR
        print(f"{MASTER_ADDR=}")
        os.environ['MASTER_PORT'] = MASTER_PORT
        print(f"{MASTER_PORT}")

        # - use NCCL if you are using gpus: https://pytorch.org/tutorials/intermediate/dist_tuto.html#communication-backends
        if torch.cuda.is_available():
            # unsure if this is really needed
            # os.environ['NCCL_SOCKET_IFNAME'] = 'eth0'
            # os.environ['NCCL_IB_DISABLE'] = '1'
            backend = 'nccl'
        print(f'{backend=}')
        # Initializes the default distributed process group, and this will also initialize the distributed package.
        dist.init_process_group(backend, rank=rank, world_size=world_size)
        # dist.init_process_group(backend, rank=rank, world_size=world_size)
        # dist.init_process_group(backend='nccl', init_method='env://', world_size=world_size, rank=rank)
        print(f'--> done setting up rank={rank}')
        dist.destroy_process_group()

mp.spawn(setup_process, args=(4,), world_size=4)
为什么这是挂?
nvidia-smi 输出:
$ nvidia-smi
Fri Mar  5 12:47:17 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                    0 |
| N/A   26C    P0    51W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      On   | 00000000:0F:00.0 Off |                    0 |
| N/A   25C    P0    52W / 400W |      3MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      On   | 00000000:47:00.0 Off |                    0 |
| N/A   25C    P0    51W / 400W |      3MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      On   | 00000000:4E:00.0 Off |                    0 |
| N/A   25C    P0    51W / 400W |      3MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  A100-SXM4-40GB      On   | 00000000:87:00.0 Off |                    0 |
| N/A   30C    P0    52W / 400W |      3MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  A100-SXM4-40GB      On   | 00000000:90:00.0 Off |                    0 |
| N/A   29C    P0    53W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  A100-SXM4-40GB      On   | 00000000:B7:00.0 Off |                    0 |
| N/A   29C    P0    52W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  A100-SXM4-40GB      On   | 00000000:BD:00.0 Off |                    0 |
| N/A   48C    P0   231W / 400W |   7500MiB / 40537MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    7   N/A  N/A    147243      C   python                           7497MiB |
+-----------------------------------------------------------------------------+
我如何在这台新机器上设置 ddp?

更新
顺便说一句,我已经成功安装了 APEX,因为其他一些链接说这样做,但它仍然失败。因为我做到了:
去了:https://github.com/NVIDIA/apex听从他们的指示
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
但在上述I had to update gcc之前:
conda install -c psi4 gcc-5
它确实在我成功导入时安装了它,但没有帮助。

现在它实际上打印了一个错误消息:
Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
KeyboardInterrupt
Process SpawnProcess-3:
Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 252, in train
    setup_process(rank, world_size=opts.world_size)
  File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/distributed.py", line 85, in setup_process
    dist.init_process_group(backend, rank=rank, world_size=world_size)
  File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: connect() timed out.

During handling of the above exception, another exception occurred:

有关的:
  • https://github.com/pytorch/pytorch/issues/9696
  • https://discuss.pytorch.org/t/dist-init-process-group-hangs-silently/55347/2
  • https://forums.developer.nvidia.com/t/imagenet-hang-on-dgx-1-when-using-multiple-gpus/61919
  • 顶点建议:https://discourse.mozilla.org/t/hangs-on-dist-init-process-group-in-distribute-py/44686
  • https://github.com/pytorch/pytorch/issues/15638
  • https://github.com/pytorch/pytorch/issues/53395
  • 最佳答案

    以下修复基于 Writing Distributed Applications with PyTorch, Initialization Methods .
    第一期:
    除非你传入 nprocs=world_size 否则它会挂起至 mp.spawn() .换句话说,它正在等待“整个世界”出现,过程明智。

    第 2 期:
    MASTER_ADDR 和 MASTER_PORT 在每个进程的环境中需要相同,并且需要是运行 rank 0 进程的机器上的空闲地址:端口组合。

    这两个都是暗示或直接从上面链接中的以下引用中读取的(已添加强调):

    Environment Variable

    We have been using the environment variable initialization method throughout this tutorial. By setting the following four environment variables on all machines, all processes will be able to properly connect to the master, obtain information about the other processes, and finally handshake with them.

    MASTER_PORT: A free port on the machine that will host the process with rank 0.

    MASTER_ADDR: IP address of the machine that will host the process with rank 0.

    WORLD_SIZE: The total number of processes, so that the master knows how many workers to wait for.

    RANK: Rank of each process, so they will know whether it is the master of a worker.



    下面是一些代码来演示这两个操作:
    import torch
    import torch.multiprocessing as mp
    import torch.distributed as dist
    import os
    
    def find_free_port():
        """ https://stackoverflow.com/questions/1365265/on-localhost-how-do-i-pick-a-free-port-number """
        import socket
        from contextlib import closing
    
        with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
            s.bind(('', 0))
            s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
            return str(s.getsockname()[1])
    
    
    def setup_process(rank, master_addr, master_port, world_size, backend='gloo'):
        print(f'setting up {rank=} {world_size=} {backend=}')
    
        # set up the master's ip address so this child process can coordinate
        os.environ['MASTER_ADDR'] = master_addr
        os.environ['MASTER_PORT'] = master_port
        print(f"{master_addr=} {master_port=}")
    
        # Initializes the default distributed process group, and this will also initialize the distributed package.
        dist.init_process_group(backend, rank=rank, world_size=world_size)
        print(f"{rank=} init complete")
        dist.destroy_process_group()
        print(f"{rank=} destroy complete")
            
    if __name__ == '__main__':
        world_size = 4
        master_addr = '127.0.0.1'
        master_port = find_free_port()
        mp.spawn(setup_process, args=(master_addr,master_port,world_size,), nprocs=world_size)
    

    关于python - 如何解决 dist.init_process_group 挂起(或死锁)?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66498045/

    相关文章:

    python - 在 Saxon/C 中使用条件包含/静态参数?

    python - 情感分析 - 天赋预训练模型分类器。如何加速

    python - scikit-learn 中的随机分层 k 折交叉验证?

    python - 如何将 PyTorch 张量的每一行中的重复值清零?

    deep-learning - 微调 Faster RCNN 对象检测模型后,如何可视化 bbox 预测?

    python - 如何查询引用属性?

    python - 如何在 Python 中部分转置 CSV 表

    python - 如何将一个函数延迟 X 个滴答数?

    使用 apply 函数替换在 R 中的 dfs 列表上迭代运行向后逐步回归的 for 循环,以减少计算时间

    lstm - 将 LSTM 中的 Tanh 激活更改为 ReLU