bert-language-model - PyTorch 中 Bert 预训练模型推理的正常速度是多少

标签 bert-language-model huggingface-transformers transformer-model huggingface-tokenizers

我正在 Huggingface 中测试 Bert 基础模型和 Bert 蒸馏模型,具有 4 种速度场景,batch_size = 1:

1) bert-base-uncased: 154ms per request
2) bert-base-uncased with quantifization: 94ms per request
3) distilbert-base-uncased: 86ms per request
4) distilbert-base-uncased with quantifization: 69ms per request

我使用IMDB文本作为实验数据,并设置max_length=512,所以它相当长。 Ubuntu 18.04 上的 CPU 信息如下:

cat /proc/cpuinfo  | grep 'name'| uniq
model name  : Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz

机器有3个GPU可供使用:

Tesla V100-SXM2

对于实时应用程序来说似乎相当慢。这些速度对于 bert 基础模型来说正常吗?

测试代码如下:

import pandas as pd
import torch.quantization

from transformers import AutoTokenizer, AutoModel, DistilBertTokenizer, DistilBertModel

def get_embedding(model, tokenizer, text):
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    outputs = model(**inputs)
    output_tensors = outputs[0][0]
    output_numpy = output_tensors.detach().numpy()
    embedding = output_numpy.tolist()[0]

def process_text(model, tokenizer, text_lines):
    for index, line in enumerate(text_lines):
        embedding = get_embedding(model, tokenizer, line)
        if index % 100 == 0:
            print('Current index: {}'.format(index))

import time
from datetime import timedelta
if __name__ == "__main__":

    df = pd.read_csv('../data/train.csv', sep='\t')
    df = df.head(1000)
    text_lines = df['review']
    text_line_count = len(text_lines)
    print('Text size: {}'.format(text_line_count))

    start = time.time()

    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    model = AutoModel.from_pretrained("bert-base-uncased")
    process_text(model, tokenizer, text_lines)

    end = time.time()
    print('Total time spent with bert base: {}'.format(str(timedelta(seconds=end - start))))

    model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
    process_text(model, tokenizer, text_lines)

    end2 = time.time()
    print('Total time spent with bert base quantization: {}'.format(str(timedelta(seconds=end2 - end))))

    tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
    model = DistilBertModel.from_pretrained("distilbert-base-uncased")
    process_text(model, tokenizer, text_lines)

    end3 = time.time()
    print('Total time spent with distilbert: {}'.format(str(timedelta(seconds=end3 - end2))))

    model = DistilBertModel.from_pretrained("distilbert-base-uncased")
    model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
    process_text(model, tokenizer, text_lines)

    end4 = time.time()
    print('Total time spent with distilbert quantization: {}'.format(str(timedelta(seconds=end4 - end3))))

编辑:根据建议我更改为以下内容:

inputs = tokenizer(text_batch, padding=True, return_tensors="pt")
outputs = model(**inputs)

其中text_batch是作为输入的文本列表。

最佳答案

不,你可以加快速度。

首先,为什么要使用批量大小 1 进行测试?

tokenizermodel 都接受批量输入。基本上,您可以传递一个二维数组/列表,其中每行包含一个样本。请参阅分词器的文档:https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.__call_ _ 这同样适用于模型。

此外,即使您使用的批量大小大于 1,您的 for 循环也是连续的。您可以创建测试数据,然后将 Trainer 类与 trainer.predict() 结合使用>

另请参阅我在 HF 论坛上的讨论:https://discuss.huggingface.co/t/urgent-trainer-predict-and-model-generate-creates-totally-different-predictions/3426

关于bert-language-model - PyTorch 中 Bert 预训练模型推理的正常速度是多少,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67699354/

相关文章:

machine-learning - 如何增加 BERT 句子转换器嵌入的维向量大小

tokenize - 将 Huggingface 标记映射到原始输入文本

python - 快速和慢速分词器产生不同的结果

Java XML Transformer 将 "\n"替换为空格

java - 将文档转换为字符串时出错

java - 撒克逊 9.2/Java/XSLT : setting transformer parameters using setParameters()

python - 具有 dropout 设置的 Transformers 预训练模型

python - 导入 BERT : module 'tensorflow._api.v2.train' has no attribute 'Optimizer' 时出错

nlp - 单向 Transformer VS 双向 BERT

python - Huggingface MarianMT 翻译器丢失内容,具体取决于模型