python - 将 PyTorch 代码移至 GPU 时处理大幅减速问题

我有一个使用 Pytorch 编写的图神经网络模型。在我的 CPU 上，我没有获得出色的性能，因此我尝试将其移植到我可以使用的 V100 GPU 上。在此过程中，我的性能出现了巨大的下降(大约慢了 10 倍)。

对于可能出现问题的地方，我有两个想法，但我需要一些输入来尝试从我的模型中获得最佳性能。第一个问题可能来 self 的自定义图卷积层:

class GraphConvLayer(torch.nn.Module):
    """
    Based, basically, on https://arxiv.org/abs/1609.02907
    Have some modifications:
    https://towardsdatascience.com/how-to-do-deep-learning-on-graphs-with-graph-convolutional-networks-7d2250723780
    This helped:
    https://pytorch.org/docs/master/notes/extending.html

    """
    def __init__(self, input_features, output_features, device, bias=True):
        super(GraphConvLayer, self).__init__()

        self.input_features = input_features
        self.output_features = output_features
        self.device = device

        self.weight = nn.Parameter(torch.FloatTensor(self.input_features, self.output_features))

        if bias:
            self.bias = nn.Parameter(torch.FloatTensor(self.output_features))
        else:
            self.register_parameter('bias', None)

        # Not a very smart way to initialize weights
        self.weight.data.uniform_(-0.1, 0.1)
        if bias is not None:
            self.bias.data.uniform_(-0.1, 0.1)

    def forward(self,input, adj):
        # Here, we put in the forward pass:
        # Our forward pass needs to be:
        # D^-1 * (A + 1) * X * weights
        input, adj = input.float(), adj.float()

        Identity = torch.eye( len(adj[0]), device = self.device)
        A_hat = adj + Identity

        D = torch.sum(A_hat, dim=0)
        len_D = len(D)
        zero = torch.zeros(len_D,len_D, device = self.device)
        mask = torch.diag(torch.ones_like(D, device = self.device))
        D = mask*torch.diag(D) + (1. - mask)*zero

        D_inv = torch.inverse(D)
        out = torch.mm(input, self.weight)
        out = torch.spmm(A_hat,out)
        out = torch.spmm(D_inv, out)

        if self.bias is not None:
            return out + self.bias
        else:
            return out

        return out

    def extra_repr(self):
        # (Optional)Set the extra information about this module. You can test
        # it by printing an object of this class.
        return 'node_features={}, length of weights={}, bias={}'.format(
            self.node_features, self.input_features, self.bias is not None
        )

具体来说，在前向传递中，我正在执行类(class)中“面向数据科学”链接中描述的一系列转换。这里有什么东西导致了如此大的减速吗？在我看来，张量都是在 GPU 上初始化的。

其次，由于我所有的图表大小不同，我被迫使用批量大小 1。在我的训练循环中，我有以下内容:

        for batch in tqdm(train_loader):
            opt.zero_grad()
            adjacency, features, _, nodes = batch
            adjacency = adjacency.to(device)
            features = features.to(device)
            nodes = nodes.to(device)

            output = model(features[0], adjacency[0])

            loss = F.nll_loss(output, nodes[0])
            loss.backward()
            opt.step()

这意味着(据我解释)每个循环中的每一条数据都被单独移动到 GPU。这似乎是效率低下的一个明显原因。有没有办法在训练循环之外将所有数据一次性移入 GPU 内存，从而允许我删除 adjacency = adjacency.to(device) 行？

任何帮助将不胜感激。

最佳答案

你的问题几乎肯定与 GPU 的内存移动有关，特别是当你提到你的单一批处理时。

可以帮助您加快当前实现速度的唯一方法可能是查看 memory maps ，我们无法根据提供的代码查看您是否已经在使用它们。

除此之外，即使邻接矩阵大小不同，padding如果您设法按稍微相等的大小对批处理进行排序，这可能是一个有效的策略。

您的 forward() 函数显然也没有优化，并且可能能够提供某种加速，但我希望针对更好的批处理进行优化会带来更大的改进。

关于python - 将 PyTorch 代码移至 GPU 时处理大幅减速问题，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/60189355/

python - 将 PyTorch 代码移至 GPU 时处理大幅减速问题

上一篇：python - 如何处理巨大的 numpy 数组的计算以避免内存分配错误？

下一篇：python - pandas 用其他数据帧替换数据帧中的行