python - 为什么使用 loss.backward() 与 torch.auto.grad 时梯度不相等?

标签 python pytorch gradient-descent autograd stochastic-gradient

当我尝试通过 SGD“手动”优化网络参数时,遇到了这种奇怪的行为。当尝试使用以下方式更新模型的参数时,它工作得很好:

for _ in trange(epochs):
    for x, y in train_loader:
        x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
        loss = F.cross_entropy(m(x), y)
        grad = torch.autograd.grad(loss, m.parameters())
        with torch.no_grad():
            for p, g in zip(m.parameters(), grad):
                p -= 0.1 * g

但是,执行以下操作会完全脱离模型:

for _ in trange(epochs):
    for x, y in train_loader:
        x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
        loss = F.cross_entropy(m(x), y)
        loss.backward()
        with torch.no_grad():
            for p in m.parameters():
                p -= 0.1 * p.grad

但对我来说,这两种方法应该是等效的。经过进一步检查,将 grad 中的 g 值与 m.paramters() 中的 p.grad 值进行比较,结果发现梯度值相同!我还尝试删除 with torch.no_grad(): 并执行以下操作,但它也不起作用:

        for p in m.parameters():
            p.data -= 0.1 * p.grad

有人可以解释一下为什么会发生这种情况吗?两种方法中的梯度不应该具有相同的值吗(请记住,两个模型 m 是相同的)?

可重现的示例:

确保再现性:

device = torch.device('cuda')
torch.manual_seed(0)
torch.cuda.manual_seed_all(0)
np.random.seed(0)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
torch.cuda.empty_cache()

加载数据:

T = transforms.ToTensor()
train_data = datasets.MNIST(root='data', transform=T, download=True)
test_data = datasets.MNIST(root='data', transform=T, train=False, download=True)

BS = 300
epochs = 5
LR = 0.1

train_loader = DataLoader(train_data, batch_size=BS, pin_memory=True)
test_loader = DataLoader(test_data, batch_size=1000, pin_memory=True)

定义要优化的模型:

class Model(nn.Module):
    def __init__(self, out_dims):
        super().__init__()
        self.conv1 = nn.Conv2d(1, out_dims, 3, stride=3, padding=1)
        self.conv2 = nn.Sequential(nn.Conv2d(out_dims, out_dims * 2, 3), nn.BatchNorm2d(out_dims * 2), nn.ReLU())
        self.conv3 = nn.Sequential(nn.Conv2d(out_dims * 2, out_dims * 4, 4, stride=2, padding=1), nn.BatchNorm2d(out_dims * 4), nn.ReLU(), nn.Flatten())
        self.fc = nn.Linear(out_dims * 4 * 16, 10)

    def forward(self, x):
        return nn.Sequential(*tuple(self.children()))(x)


m1 = Model(5).to(device)
m2 = deepcopy(m1)  # "m2.load_state_dict(m1.state_dict())" doesn't work either

培训和评估:

# M1's training:
for _ in trange(epochs):
    for x, y in train_loader:
        x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
        loss = F.cross_entropy(m1(x), y)
        grad = torch.autograd.grad(loss, m1.parameters())
        with torch.no_grad():
            for p, g in zip(m1.parameters(), grad):
                p -= LR * g
                
# M1's evaluation:
m1.eval()
acc1 = []
with torch.no_grad():
    for x, y in test_loader:
        x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
        _, pred = m1(x).max(1)
        acc1.append(metric(pred, y).item())

print(f'Accuracy: {np.mean(acc1) * 100:.4}%')


# M2's training:
for _ in trange(epochs):
    for x, y in train_loader:
        x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
        loss = F.cross_entropy(m2(x), y)
        loss.backward()
        with torch.no_grad():
            for p in m2.parameters():
                p -= LR * p.grad

# M2's evaluation:
m2.eval()
acc2 = []
with torch.no_grad():
    for x, y in test_loader:
        x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
        _, pred = m2(x).max(1)
        acc2.append(metric(pred, y).item())

print(f'Accuracy: {np.mean(acc2) * 100:.4}%')

最佳答案

我花了一段时间才弄清楚,但问题出在 loss.backward() 中。与计算并返回梯度的 autograd.grad() 不同,inplace backward() 计算并累积参与计算的节点的梯度图形。换句话说,当用于反向传播一次时,两者将具有相同的效果,但每次重复 backward() 都会将当前计算的梯度添加到所有先前的梯度中(因此出现分歧)。使用 model.zero_grad() 重置梯度可以修复问题。

关于python - 为什么使用 loss.backward() 与 torch.auto.grad 时梯度不相等?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70668522/

相关文章:

python - 为什么 ffmpeg-python 会在这里抛出编解码器错误?

python - 如何通过 Flask 中的 url_for() 检查静态文件是否存在?

python - 将图像转换为范围为 [0,255] 而不是 [0,1] 的张量?

python - 索引错误: too many indices for tensor of dimension 2

pytorch - yoloV8 : how I can to predict and save the image with boxes on the objects with pytorch

matlab - 关于何时在 Matlab 中使用矩阵乘法、sum() 或 for 循环的良好经验法则?

machine-learning - 随机梯度下降增加成本函数

python - 如何删除 Countvectorizer 中存在的数字字符?

python - Django:模型表单 "object has no attribute ' clean_data'"

python - 如何在python中实现小批量梯度下降?