python - pytorch 模型在第一轮后返回 NAN

标签 python machine-learning deep-learning pytorch backpropagation

这是我第一次编写基于 Pytorch 的 CNN。我终于得到了运行到为第一个数据批处理生成输出的代码,但在第二个批处理中生成了 nan。为了调试目的,我极大地简化了模型,但它仍然无法正常工作。这里显示的模型只是几个具有线性输出的全连接层。

我猜测问题出在反向传播步骤上,但我不清楚问题出在哪里以及为什么。

这是一个非常简化的模型版本,但仍然会产生错误:

数据加载器:

batch_size = 36
device = 'cuda'
# note "rollaxis" to move channel from last to first dimension
# X_train is n input images x 70 width x 70 height x 3 channels
# Y_train is n doubles
torch_train = utils.TensorDataset(torch.from_numpy(np.rollaxis(X_train, 3, 1)).float(), torch.from_numpy(Y_train).float())
train_loader = utils.DataLoader(torch_train, batch_size=batch_size, shuffle=True)

定义并创建模型:

def MyCNN(**kwargs):
    return MyCNN_model_simple(**kwargs)

# switched from Sequential() style to assist debugging
class MyCNN_model_simple(nn.Module):
    def __init__(self, **kwargs):
        super(MyCNN_model_simple, self).__init__()
        self.fc1 = FullyConnected( 3 * 70 * 70, 100)
        self.fc2 = FullyConnected( 100, 100)
        self.last = nn.Linear(100, 1)
#         self.net = nn.Sequential(
#             self.fc1,
#             self.fc2,
#             self.last,
#             nn.Flatten()
#         )
    def forward(self, x):
        print(f"x shape A: {x.shape}")
        x = torch.flatten(x, 1)
        print(f"x shape B: {x.shape}")
        x = self.fc1(x)
        print(f"x shape C: {x.shape}")
        x = self.fc2(x)
        print(f"x shape D: {x.shape}")
        x = self.last(x)
        print(f"x shape E: {x.shape}")
        x = torch.flatten(x)
        print(f"x shape F: {x.shape}")
        return x
#        return self.net(x)

class FullyConnected(nn.Module):
    def __init__(self, in_channels, out_channels, dropout=None):
        super(FullyConnected, self).__init__()       
        layers = []
        layers.append(nn.Linear(in_channels, out_channels, bias=True))
        layers.append(nn.ReLU())
        if dropout != None:
            layers.append(nn.Dropout(p=dropout)) 
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

model = MyCNN()
# convert to 16-bit half-precision to save memory
model.half()
model.to(torch.device('cuda'))

运行模型:

loss_fn = nn.MSELoss()
dev = torch.device('cuda')
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
losses = []
max_batches = 2

def process_batch():
    inputs = images.half().to(dev)
    values = scores.half().to(dev)
    
    # clear accumulated gradients
    optimizer.zero_grad()
    # make predictions
    outputs = model(inputs)
    # calculate and save the loss
    model_out = torch.flatten(outputs)
    print(f"Outputs: {model_out}")
    loss = loss_fn(model_out.half(), torch.flatten(values))
    losses.append( loss.item() )
    # backpropogate the loss
    loss.backward()
    # adjust parameters to computed gradients
    optimizer.step()


model.train()
i = 0
for images, scores in train_loader:
    process_batch()
    i += 1
    if i > max_batches: break

标准输出:

x shape A: torch.Size([36, 3, 70, 70])
x shape B: torch.Size([36, 9800])
x shape C: torch.Size([36, 100])
x shape D: torch.Size([36, 100])
x shape E: torch.Size([36, 1])
x shape F: torch.Size([36])
Outputs: tensor([0.0406, 0.0367, 0.0446, 0.0529, 0.0406, 0.0391, 0.0397, 0.0391, 0.0415,
        0.0443, 0.0410, 0.0406, 0.0349, 0.0396, 0.0368, 0.0401, 0.0343, 0.0419,
        0.0428, 0.0385, 0.0345, 0.0431, 0.0287, 0.0328, 0.0309, 0.0416, 0.0473,
        0.0352, 0.0422, 0.0375, 0.0428, 0.0345, 0.0368, 0.0319, 0.0365, 0.0382],
       device='cuda:0', dtype=torch.float16, grad_fn=<AsStridedBackward>)

x shape A: torch.Size([36, 3, 70, 70])
x shape B: torch.Size([36, 9800])
x shape C: torch.Size([36, 100])
x shape D: torch.Size([36, 100])
x shape E: torch.Size([36, 1])
x shape F: torch.Size([36])
Outputs: tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
       device='cuda:0', dtype=torch.float16, grad_fn=<AsStridedBackward>)

x shape A: torch.Size([36, 3, 70, 70])
x shape B: torch.Size([36, 9800])
x shape C: torch.Size([36, 100])
x shape D: torch.Size([36, 100])
x shape E: torch.Size([36, 1])
x shape F: torch.Size([36])
Outputs: tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
       device='cuda:0', dtype=torch.float16, grad_fn=<AsStridedBackward>)

您可以看到从第二批开始的模型中出现的 nan。我正在做的事情有明显错误吗?如果有人有关于调试 pytorch 模块运行的最佳实践的提示,我可以用它来追踪问题,那将非常有帮助。

谢谢。

最佳答案

更新梯度时应切换为全精度,训练时应切换为半精度

loss.backward()
model.float() # add this here
optimizer.step()

切换回半精度

for images, scores in train_loader:
    model.half()  # add this here
    process_batch()

关于python - pytorch 模型在第一轮后返回 NAN,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58457901/

相关文章:

python - 计算给定点、方位和距离的 gps 坐标

deep-learning - gym.spaces.box 观察状态理解

python - python脚本中的多处理函数

python - 我如何在 Ubuntu 中使用 pyenv 回到我的系统 python

python - 如何在交互模式下使用Elmo词嵌入与原始预训练模型(5.5B)

machine-learning - 杰夫·欣顿 (Geoff Hinton) 在训练他的系统进行手写数字识别时,从图像中提取的深度信念网络到底有什么特征?

随机森林 : how to get 100%-Precision?

python - 掩码-rcnn :Need advice for the Prediction about the root/handler and orientation of balloons

python - 在 Keras 中实现 LSTM

python - 安装 pip 以导入模块