python - 有没有办法使用 HuggingFace TrainerAPI 在同一个图表上绘制训练和验证损失?

标签 python pytorch huggingface-transformers tensorboard

我正在使用 HF Seq2SeqTrainingArguments 和 Seq2SeqTrainer 微调 HuggingFace 变压器模型(PyTorch 版本),并且我想在 Tensorboard 中显示训练和验证损失(在同一张图表中)。

据我了解,为了将两个损失绘制在一起,我需要使用 SummaryWriter。 HF Callbacks 文档描述了一个可以接收 tb_writer 参数的 TensorBoardCallback 函数:

https://huggingface.co/docs/transformers/v4.21.1/en/main_classes/callback#transformers.integrations.TensorBoardCallback

但是,如果它应该与 Trainer API 一起使用,我无法弄清楚什么是正确的使用方法。

我的代码看起来像这样:

args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    evaluation_strategy='epoch',
    learning_rate= 1e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=num_train_epochs,
    predict_with_generate=True,
    logging_steps=logging_steps,
    report_to='tensorboard',
    push_to_hub=False,  
)

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_train_data,
    eval_dataset=tokenized_val_data,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

我认为我应该在训练器中包含对 TensorBoard 的回调,例如

callbacks = [TensorBoardCallback(tb_writer=tb_writer)]

但我找不到如何使用/导入什么来使用它的综合示例。

我还在 GitHub 上发现了此功能请求,

https://github.com/huggingface/transformers/pull/4020

但没有使用示例,所以我很困惑......

任何见解将不胜感激

最佳答案

据我所知,在同一个 TensorBoard 图表上绘制两个值的唯一方法是使用两个具有相同根目录的单独的SummaryWriter。例如,日志记录目录可能是:log_dir/trainlog_dir/eval

此方法用于 this answer但对于 TensorFlow 而不是 pytorch

为了使用 🤗 Trainer API 执行此操作,需要一个自定义回调,该回调需要两个 SummaryWriter。以下是我的自定义回调 CombinedTensorBoardCallback 的代码,它是我通过修改 TensorBoardCallback 的代码而制作的:

import os
from transformers.integrations import TrainerCallback, is_tensorboard_available

def custom_rewrite_logs(d, mode):
    new_d = {}
    eval_prefix = "eval_"
    eval_prefix_len = len(eval_prefix)
    test_prefix = "test_"
    test_prefix_len = len(test_prefix)
    for k, v in d.items():
        if mode == 'eval' and k.startswith(eval_prefix):
            if k[eval_prefix_len:] == 'loss':
                new_d["combined/" + k[eval_prefix_len:]] = v
        elif mode == 'test' and k.startswith(test_prefix):
            if k[test_prefix_len:] == 'loss':
                new_d["combined/" + k[test_prefix_len:]] = v
        elif mode == 'train':
            if k == 'loss':
                new_d["combined/" + k] = v
    return new_d


class CombinedTensorBoardCallback(TrainerCallback):
    """
    A [`TrainerCallback`] that sends the logs to [TensorBoard](https://www.tensorflow.org/tensorboard).
    Args:
        tb_writer (`SummaryWriter`, *optional*):
            The writer to use. Will instantiate one if not set.
    """

    def __init__(self, tb_writers=None):
        has_tensorboard = is_tensorboard_available()
        if not has_tensorboard:
            raise RuntimeError(
                "TensorBoardCallback requires tensorboard to be installed. Either update your PyTorch version or"
                " install tensorboardX."
            )
        if has_tensorboard:
            try:
                from torch.utils.tensorboard import SummaryWriter  # noqa: F401

                self._SummaryWriter = SummaryWriter
            except ImportError:
                try:
                    from tensorboardX import SummaryWriter

                    self._SummaryWriter = SummaryWriter
                except ImportError:
                    self._SummaryWriter = None
        else:
            self._SummaryWriter = None
        self.tb_writers = tb_writers

    def _init_summary_writer(self, args, log_dir=None):
        log_dir = log_dir or args.logging_dir
        if self._SummaryWriter is not None:
            self.tb_writers = dict(train=self._SummaryWriter(log_dir=os.path.join(log_dir, 'train')),
                                   eval=self._SummaryWriter(log_dir=os.path.join(log_dir, 'eval')))

    def on_train_begin(self, args, state, control, **kwargs):
        if not state.is_world_process_zero:
            return

        log_dir = None

        if state.is_hyper_param_search:
            trial_name = state.trial_name
            if trial_name is not None:
                log_dir = os.path.join(args.logging_dir, trial_name)

        if self.tb_writers is None:
            self._init_summary_writer(args, log_dir)

        for k, tbw in self.tb_writers.items():
            tbw.add_text("args", args.to_json_string())
            if "model" in kwargs:
                model = kwargs["model"]
                if hasattr(model, "config") and model.config is not None:
                    model_config_json = model.config.to_json_string()
                    tbw.add_text("model_config", model_config_json)
            # Version of TensorBoard coming from tensorboardX does not have this method.
            if hasattr(tbw, "add_hparams"):
                tbw.add_hparams(args.to_sanitized_dict(), metric_dict={})

    def on_log(self, args, state, control, logs=None, **kwargs):
        if not state.is_world_process_zero:
            return

        if self.tb_writers is None:
            self._init_summary_writer(args)

        for tbk, tbw in self.tb_writers.items():
            logs_new = custom_rewrite_logs(logs, mode=tbk)
            for k, v in logs_new.items():
                if isinstance(v, (int, float)):
                    tbw.add_scalar(k, v, state.global_step)
                else:
                    logger.warning(
                        "Trainer is attempting to log a value of "
                        f'"{v}" of type {type(v)} for key "{k}" as a scalar. '
                        "This invocation of Tensorboard's writer.add_scalar() "
                        "is incorrect so we dropped this attribute."
                    )
            tbw.flush()

    def on_train_end(self, args, state, control, **kwargs):
        for tbw in self.tb_writers.values():
            tbw.close()
        self.tb_writers = None

如果您想将训练和评估结合起来以获取除损失之外的其他指标,则应相应修改 custom_rewrite_logs

与往常一样,回调位于 Trainer 构造函数中。在我的测试示例中是:

trainer = Trainer(
    model=rnn,
    args=train_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[CombinedTensorBoardCallback]
)

此外,您可能希望删除默认的 TensorBoardCallback,否则除了组合损失图之外,训练损失和验证损失将像默认情况下一样单独显示。

trainer.remove_callback(TensorBoardCallback)

这是生成的 TensorBoard View :

enter image description here

关于python - 有没有办法使用 HuggingFace TrainerAPI 在同一个图表上绘制训练和验证损失?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/73281901/

相关文章:

python - HuggingFace BERT `inputs_embeds` 给出了意想不到的结果

python - mysql-python 的 bigint 格式说明符

python - 在 pycharm 上使用 plotly 时出错

使用手动种子生成不同随机数的 Pytorch 生成器

pytorch - 切换 GPU 设备会影响 PyTorch 反向传播中的梯度吗?

huggingface-transformers - HuggingFace 中 from_config 和 from_pretrained 之间的区别

python - 调用使用 argparse 的函数

python - 使用 python 在文本文件中搜索特定字符的出现

python - 如何在PyTorch中正确使用Numpy的FFT函数?

google-cloud-platform - 在 Google Cloud BigQuery 中存储句子嵌入