python-3.x - 使用 Pytorch Lightning DDP 时记录事情的正确方法

我想知道使用 DDP 时记录指标的正确方法是什么。我注意到如果我想在里面打印一些东西 validation_epoch_end使用 2 个 GPU 时将打印两次。我在期待 validation_epoch_end仅在 0 级调用并接收来自所有 GPU 的输出，但我不确定这是否正确。因此，我有几个问题:

validation_epoch_end(self, outputs) - 当使用 DDP 时，每个子进程是否接收从当前 GPU 处理的数据或从所有 GPU 处理的数据，即输入参数 outputs包含来自所有 GPU 的整个验证集的输出？

如 outputs是 GPU/进程特定的，在 validation_epoch_end 中计算整个验证集的任何指标的正确方法是什么？什么时候用DDP？

我知道我可以通过查看 self.global_rank == 0 来解决打印问题。并且仅在这种情况下打印/记录，但是我试图更深入地了解在这种情况下我正在打印/记录的内容。
这是我的用例中的代码片段。我希望能够报告整个验证数据集的 f1、精度和召回率，我想知道在使用 DDP 时正确的做法是什么。

    def _process_epoch_outputs(self,
                               outputs: List[Dict[str, Any]]
                               ) -> Tuple[torch.Tensor, torch.Tensor]:
        """Creates and returns tensors containing all labels and predictions

        Goes over the outputs accumulated from every batch, detaches the
        necessary tensors and stacks them together.

        Args:
            outputs (List[Dict])
        """
        all_labels = []
        all_predictions = []

        for output in outputs:
            for labels in output['labels'].detach():
                all_labels.append(labels)

            for predictions in output['predictions'].detach():
                all_predictions.append(predictions)

        all_labels = torch.stack(all_labels).long().cpu()
        all_predictions = torch.stack(all_predictions).cpu()

        return all_predictions, all_labels

    def validation_epoch_end(self, outputs: List[Dict[str, Any]]) -> None:
        """Logs f1, precision and recall on the validation set."""

        if self.global_rank == 0:
            print(f'Validation Epoch: {self.current_epoch}')

        predictions, labels = self._process_epoch_outputs(outputs)
        for i, name in enumerate(self.label_columns):

            f1, prec, recall, t = metrics.get_f1_prec_recall(predictions[:, i],
                                                             labels[:, i],
                                                             threshold=None)
            self.logger.experiment.add_scalar(f'{name}_f1/Val',
                                              f1,
                                              self.current_epoch)
            self.logger.experiment.add_scalar(f'{name}_Precision/Val',
                                              prec,
                                              self.current_epoch)
            self.logger.experiment.add_scalar(f'{name}_Recall/Val',
                                              recall,
                                              self.current_epoch)

            if self.global_rank == 0:
                print((f'F1: {f1}, Precision: {prec}, '
                       f'Recall: {recall}, Threshold {t}'))

最佳答案

问题

validation_epoch_end(self, outputs) - When using DDP does every subprocess receive the data processed from the current GPU or data processed from all GPUs, i.e. does the input parameter outputs contains the outputs of the entire validation set, from all GPUs?

仅从当前 GPU 处理的数据 ，输出不同步，只有backward同步(梯度在训练期间同步并分发到驻留在每个 GPU 上的模型副本)。
想象一下，所有的输出都是从 1000 传递过来的。 GPU给这个可怜的主人，它很容易给它一个OOM

If outputs is GPU/process specific what is the proper way to calculate any metric on the entire validation set in validation_epoch_end when using DDP?

根据 documentation (强调我的):

When validating using a accelerator that splits data from each batch across GPUs, sometimes you might need to aggregate them on the master GPU for processing (dp, or ddp2).

这是随附的代码(在这种情况下， validation_epoch_end 将从单个步骤接收跨多个 GPU 的累积数据，还有 参见评论 ):

# Done per-process (GPU)
def validation_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.model(x)
    loss = F.cross_entropy(y_hat, y)
    pred = ...
    return {'loss': loss, 'pred': pred}

# Gathered data from all processes (per single step)
# Allows for accumulation so the whole data at the end of epoch
# takes less memory
def validation_step_end(self, batch_parts):
    gpu_0_prediction = batch_parts.pred[0]['pred']
    gpu_1_prediction = batch_parts.pred[1]['pred']

    # do something with both outputs
    return (batch_parts[0]['loss'] + batch_parts[1]['loss']) / 2

def validation_epoch_end(self, validation_step_outputs):
   for out in validation_step_outputs:
       # do something with preds

提示

Focus on per-device calculations and as small number of between-GPU transfers as possible

内validation_step (或 training_step 如果这是您想要的，这是通用的)计算 f1 , precision , recall以及其他 按批次计算

返回这些值(例如，作为字典)。现在您将返回 3每个设备的号码而不是 (batch, outputs) (可能要大得多)

内validation_step_end得到那些 3值(实际上是 (2, 3)，如果您有 2 个 GPU)并对它们求和/取平均值并返回 3值

现在 validation_epoch_end将得到 (steps, 3)您可以用来累积的值

如果不是在 validation_epoch_end 期间操作值列表会更好您可以将它们累积到另一个 3值(假设您有很多验证步骤，列表可能会变得太大)，但这应该足够了。
AFAIK PyTorch-Lightning 不会这样做(例如，不是添加到 list ，而是直接应用一些累加器)，但我可能会弄错，所以任何更正都会很棒。

关于python-3.x - 使用 Pytorch Lightning DDP 时记录事情的正确方法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/66854148/

python-3.x - 使用 Pytorch Lightning DDP 时记录事情的正确方法

上一篇：rust - 一个独立的闭包是否可以使用 `&str` 并返回具有相同生命周期的 `&str`？

下一篇：kubernetes - 无法验证 kubernetes repo 的签名