我正在尝试使用 Trainer 训练模型,根据文档( https://huggingface.co/transformers/master/main_classes/trainer.html#transformers.Trainer )我可以指定一个标记器:
tokenizer (PreTrainedTokenizerBase, optional) – The tokenizer used to preprocess the data. If provided, will be used to automatically pad the inputs the maximum length when batching inputs, and it will be saved along the model to make it easier to rerun an interrupted training or reuse the fine-tuned model.
因此应该自动处理填充,但是在尝试运行它时出现此错误:
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.
分词器是这样创建的:
tokenizer = BertTokenizerFast.from_pretrained(pretrained_model)
和这样的训练师:trainer = Trainer(
tokenizer=tokenizer,
model=model,
args=training_args,
train_dataset=train,
eval_dataset=dev,
compute_metrics=compute_metrics
)
我试过将 padding
和 truncation
参数放在分词器、训练器和训练参数中。什么都不做。任何的想法?
最佳答案
查看您的标记器返回的列。您可能只想将其限制为所需的列。
例如
def preprocess_function(examples):
#function to tokenize the dataset.
if sentence2_key is None:
return tokenizer(examples[sentence1_key], truncation=True,padding=True)
return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True,padding=True)
encoded_dataset = dataset.map(preprocess_function,
batched=True,load_from_cache_file=False)
#Thing you should do is
columns_to_return = ['input_ids', 'label', 'attention_mask']
encoded_dataset.set_format(type='torch', columns=columns_to_return)
希望能帮助到你。
关于python - 如何使用 Huggingface-Transformers 批量制作 Trainer pad 输入?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64047261/