python - 使用 Huggingface Trainer 与分布式数据并行

标签 python pytorch huggingface-transformers

为了加快性能,我研究了 pytorches DistributedDataParallel并尝试将其应用于变压器 Trainer .
pytorch examples for DDP声明这应该 至少 更快:

DataParallel is single-process, multi-thread, and only works on a single machine, while DistributedDataParallel is multi-process and works for both single- and multi- machine training. DataParallel is usually slower than DistributedDataParallel even on a single machine due to GIL contention across threads, per-iteration replicated model, and additional overhead introduced by scattering inputs and gathering outputs.


我的 DataParallel 训练器如下所示:
import os
from datetime import datetime
import sys
import torch
from transformers import Trainer, TrainingArguments, BertConfig

training_args = TrainingArguments(
        output_dir=os.path.join(path_storage, 'results', "mlm"),  # output directory
        num_train_epochs=1,  # total # of training epochs
        gradient_accumulation_steps=2,  # for accumulation over multiple steps
        per_device_train_batch_size=4,  # batch size per device during training
        per_device_eval_batch_size=4,  # batch size for evaluation
        logging_dir=os.path.join(path_storage, 'logs', "mlm"),  # directory for storing logs
        evaluate_during_training=False,
        max_steps=20,
    )

mlm_train_dataset = ProteinBertMaskedLMDataset(
        path_vocab, os.path.join(path_storage, "data", "uniparc", "uniparc_train_sorted.h5"),
)

mlm_config = BertConfig(
        vocab_size=mlm_train_dataset.tokenizer.vocab_size,
        max_position_embeddings=mlm_train_dataset.input_size
)
mlm_model = ProteinBertForMaskedLM(mlm_config)
trainer = Trainer(
   model=mlm_model,  # the instantiated 🤗 Transformers model to be trained
   args=training_args,  # training arguments, defined above
   train_dataset=mlm_train_dataset,  # training dataset
   data_collator=mlm_train_dataset.collate_fn,
)
print("build trainer with on device:", training_args.device, "with n gpus:", training_args.n_gpu)
start = datetime.now()
trainer.train()
print(f"finished in {datetime.now() - start} seconds")
输出:
build trainer with on device: cuda:0 with n gpus: 4
finished in 0:02:47.537038 seconds
我的 DistributedDataParallel 训练器是这样构建的:
def create_transformer_trainer(rank, world_size, train_dataset, model):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    os.environ["RANK"] = str(rank)
    os.environ["WORLD_SIZE"] = str(world_size)

    training_args = TrainingArguments(
        output_dir=os.path.join(path_storage, 'results', "mlm"),  # output directory
        num_train_epochs=1,  # total # of training epochs
        gradient_accumulation_steps=2,  # for accumulation over multiple steps
        per_device_train_batch_size=4,  # batch size per device during training
        per_device_eval_batch_size=4,  # batch size for evaluation
        logging_dir=os.path.join(path_storage, 'logs', "mlm"),  # directory for storing logs
        local_rank=rank,
        max_steps=20,
    )

    trainer = Trainer(
        model=model,  # the instantiated 🤗 Transformers model to be trained
        args=training_args,  # training arguments, defined above
        train_dataset=train_dataset,  # training dataset
        data_collator=train_dataset.collate_fn,
    )
    print("build trainer with on device:", training_args.device, "with n gpus:", training_args.n_gpu)
    start = datetime.now()
    trainer.train()
    print(f"finished in {datetime.now() - start} seconds")


mlm_train_dataset = ProteinBertMaskedLMDataset(
    path_vocab, os.path.join(path_storage, "data", "uniparc", "uniparc_train_sorted.h5"))

mlm_config = BertConfig(
    vocab_size=mlm_train_dataset.tokenizer.vocab_size,
    max_position_embeddings=mlm_train_dataset.input_size
)
mlm_model = ProteinBertForMaskedLM(mlm_config)
torch.multiprocessing.spawn(create_transformer_trainer,
     args=(4, mlm_train_dataset, mlm_model),
     nprocs=4,
     join=True)
输出:
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
build trainer with on device: cuda:1 with n gpus: 1
build trainer with on device: cuda:2 with n gpus: 1
build trainer with on device: cuda:3 with n gpus: 1
build trainer with on device: cuda:0 with n gpus: 1
finished in 0:04:15.937331 seconds
finished in 0:04:16.899411 seconds
finished in 0:04:16.938141 seconds
finished in 0:04:17.391887 seconds
关于初始 fork 警告:exaclty fork 是什么,这是预期的吗?
关于由此产生的时间:是否错误地使用了训练器,因为它似乎比 DataParallel 方法慢得多?

最佳答案

参加聚会有点晚了,但无论如何。我将在这里留下这条评论,以帮助任何想知道是否可以在分词器中保持并行性的人。
根据这个comment on github FastTokenizers 似乎是问题所在。
同样根据another comment on gitmemory 在 fork 过程之前,您不应该使用标记器。 (这基本上意味着在遍历数据加载器之前)
所以解决方案是在训练/微调之前不使用 FastTokenizers 使用普通的 Tokenizers。
查看 Huggingface 文档以了解您是否真的需要 FastTokenizer。

关于python - 使用 Huggingface Trainer 与分布式数据并行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63017931/

相关文章:

pytorch - 如何在 PyTorch 中使用嵌入层作为线性层?

pytorch - Transformers PreTrainedTokenizer add_tokens 功能

python - 如何在 PySpark 中使用 UnaryTransformer?

python - 如何交错放置 5 个 PyTorch 张量?

python - 将 PyTorch 代码从 CPU 移植到 GPU

python-3.x - 如何在 tf2.keras 中进行微调时卡住 BERT 的某些层

deep-learning - 训练使用 AutoConfig 定义的拥抱面 AutoModel

用于模糊匹配的 Python 哈希表

python - 从表单提交数据到 Django View

python - 如何在具有 2 个优化器的循环中调用 "backward"?