python - Hogwild 的 PyTorch 多处理错误

我在尝试使用 torch.multiprocessing 实现 Hogwild 时遇到了一个神秘的错误。特别是，代码的一个版本运行良好，但是当我在多处理步骤之前添加看似无关的代码位时，这会在多处理步骤中以某种方式导致错误:RuntimeError: Unable to handle autograd's threading in combination with fork-based multiprocessing. See https://github.com/pytorch/pytorch/wiki/Autograd-and-Fork我在下面粘贴的最小代码示例中重现了错误。如果我注释掉这两行代码 m0 = Model(); train(m0)在单独的模型实例上执行非并行训练，然后一切正常。我无法弄清楚这些行是如何导致问题的。
我在 Linux 机器上运行 PyTorch 1.5.1 和 Python 3.7.6，仅在 CPU 上进行训练。

import torch
import torch.multiprocessing as mp
from torch import nn

def train(model):
    opt = torch.optim.Adam(model.parameters(), lr=1e-5)
    for _ in range(10000):
        opt.zero_grad()
        # We train the model to output the value 4 (arbitrarily)
        loss = (model(0) - 4)**2
        loss.backward()
        opt.step()

# Toy model with one parameter tensor of size 3.
# Output is always the sum of the elements in the tensor,
# independent of the input
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.x = nn.Parameter(torch.ones(3))

    def forward(self, x):
        return torch.sum(self.x)

############################################
# Create a separate Model instance and run
# a non-parallel training run.
# For some reason, this code causes the 
# subsequent parallel run to fail.
m0 = Model()
train(m0)
print ('Done with preliminary run')
############################################

num_processes = 2
model = Model()
model.share_memory()
processes = []
for rank in range(num_processes):
    p = mp.Process(target=train, args=(model,))
    p.start()
    processes.append(p)
for p in processes:
    p.join()
    
print(model.x)

最佳答案

如果您修改代码以创建这样的新进程:

processes = []
ctx = mp.get_context('spawn')
for rank in range(num_processes):
    p = ctx.Process(target=train, args=(model,))

它似乎运行良好(其余代码与您的相同，在 pytorch 1.5.0/python 3.6/NVIDIA T4 GPU 上测试)。
我不完全确定从非并行运行到并行运行会带来什么；我尝试为两次运行创建一个全新的模型(使用它自己的类)，和/或从原始模型中删除任何内容，和/或确保删除任何张量并释放内存，但这些都没有任何区别。
真正产生影响的是确保 .backward()从未在 mp.Process() 之外被调用在它被 mp.Process() 中的函数调用之前.我认为可以结转的是 autograd 线程；如果线程在使用默认 fork 方法进行多处理之前存在，则它会失败，如果线程是在 fork 之后创建的，则它似乎可以正常工作，如果使用 spawn，它也可以正常工作。
顺便说一句:这是一个非常有趣的问题 - 特别感谢您将其消化为一个最小的例子!

关于python - Hogwild 的 PyTorch 多处理错误，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/63081486/

python - Hogwild 的 PyTorch 多处理错误

上一篇：bash - 在 bash 中解析 .ini 文件

下一篇：php - SSH 连接到我的共享主机后，如何告诉 Deployer 使用不同的 PHP 版本？