python - 如何将交叉编码器与 Huggingface 变压器管道一起使用?

标签 python nlp huggingface-transformers sentence-transformers large-language-model

huggingface 集线器上有一组来自 sentence_transformers 库的模型,例如https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1

建议的用法示例是:

# Using sentence_transformers

from sentence_transformers import CrossEncoder

model_name = 'cross-encoder/mmarco-mMiniLMv2-L12-H384-v1'
model = CrossEncoder(model_name)
scores = model.predict([
  ['How many people live in Berlin?', 'How many people live in Berlin?'], 
  ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.']
])
scores

[输出]:

array([ 0.36782095, -4.2674575 ], dtype=float32)

或者

# From transformers.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
import torch

# cross-encoder/ms-marco-MiniLM-L-12-v2
model = AutoModelForSequenceClassification.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')
tokenizer = AutoTokenizer.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')

features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], 
                     ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'],  
                     padding=True, truncation=True, return_tensors="pt")

model.eval()
with torch.no_grad():
    scores = model(**features).logits
    print(scores)

[输出]:

tensor([[10.7615],
        [-8.1277]])

如果用户想要在这些交叉编码器模型上使用transformers.pipeline,则会抛出错误:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
import torch

# cross-encoder/ms-marco-MiniLM-L-12-v2
model = AutoModelForSequenceClassification.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')
tokenizer = AutoTokenizer.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')

pipe = pipeline(model=model, tokenizer=tokenizer)

它抛出一个错误:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_108/785368641.py in <module>
----> 1 pipe = pipeline(model=model, tokenizer=tokenizer)

/opt/conda/lib/python3.7/site-packages/transformers/pipelines/__init__.py in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, use_auth_token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
    711         if not isinstance(model, str):
    712             raise RuntimeError(
--> 713                 "Inferring the task automatically requires to check the hub with a model_id defined as a `str`."
    714                 f"{model} is not a valid model_id."
    715             )

RuntimeError: Inferring the task automatically requires to check the hub with a model_id defined as a `str`.

问:如何将交叉编码器与 Huggingface Transformer 管道结合使用?

问:如果需要 model_id,是否可以将 model_id 作为 argskwargs 添加到 pipeline 中?

有一个类似的问题Error: Inferring the task automatically requires to check the hub with a model_id defined as a `str`. AraBERT model但我不确定这是同一个问题,因为另一个问题是在 'aubmindlab/bert-base-arabertv02' 上,而不是在 sentence_transformers 模型的交叉编码器类上>.

最佳答案

经过多次尝试、错误和代码挖掘

  • 这部分代码将包含可用的预编码管道任务列表https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/__init__.py

  • 这里记录了如何将文本对输入模型https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/text_classification.py# L121

  • 这是text-classification 任务编码的一般用法,并且用法位于函数 https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/text_classification.py 的文档字符串中。

现在,这里...

TL;DR

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
import torch

# cross-encoder/ms-marco-MiniLM-L-12-v2
model = AutoModelForSequenceClassification.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')
tokenizer = AutoTokenizer.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')

pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)


pipe([{"text": 'How many people live in Berlin?', "text_pair": 'Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.'},
      {"text": 'How many people live in Berlin?', "text_pair": 'New York City is famous for the Metropolitan Museum of Art.'},

[输出]:

[{'label': 'LABEL_0', 'score': 0.99997878074646},
 {'label': 'LABEL_0', 'score': 0.0002951461647171527},
 {'label': 'LABEL_0', 'score': 0.027012893930077553}]

但输出与使用 sentence_transformers 不同!

是的,不是,因为 softmax 已应用于输出。

有一个在模型推理后应用的分类函数,位于 https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/text_classification.py#L27

class ClassificationFunction(ExplicitEnum):
    SIGMOID = "sigmoid"
    SOFTMAX = "softmax"
    NONE = "none"

特别是在 https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/text_classification.py#L184

    def postprocess(self, model_outputs, function_to_apply=None, top_k=1, _legacy=True):
        # `_legacy` is used to determine if we're running the naked pipeline and in backward
        # compatibility mode, or if running the pipeline with `pipeline(..., top_k=1)` we're running
        # the more natural result containing the list.
        # Default value before `set_parameters`
        if function_to_apply is None:
            if self.model.config.problem_type == "multi_label_classification" or self.model.config.num_labels == 1:
                function_to_apply = ClassificationFunction.SIGMOID
            elif self.model.config.problem_type == "single_label_classification" or self.model.config.num_labels > 1:
                function_to_apply = ClassificationFunction.SOFTMAX
            elif hasattr(self.model.config, "function_to_apply") and function_to_apply is None:
                function_to_apply = self.model.config.function_to_apply
            else:
                function_to_apply = ClassificationFunction.NONE

TL;DR(这次是真的)

要复制推出您自己的 tokenize + 转发函数的结果,您必须显式设置分类函数并覆盖后处理函数,即

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
from transformers.pipelines.text_classification import ClassificationFunction


model = AutoModelForSequenceClassification.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')
tokenizer = AutoTokenizer.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')

pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, function_to_apply=ClassificationFunction.NONE)


pipe([{"text": 'How many people live in Berlin?', "text_pair": 'Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.'},
      {"text": 'How many people live in Berlin?', "text_pair": 'New York City is famous for the Metropolitan Museum of Art.'},
      {"text": 'Hello how are you?', "text_pair": "I'm fine, thank you"},
      
     ])

[输出]:

[{'label': 'LABEL_0', 'score': 10.761542320251465},
 {'label': 'LABEL_0', 'score': -8.127744674682617},
 {'label': 'LABEL_0', 'score': -3.5840566158294678}]

关于python - 如何将交叉编码器与 Huggingface 变压器管道一起使用?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/76079388/

相关文章:

python - 当分类器从测试文件进行预测时,为什么会有额外的标签?

python - PyTorch torch.no_grad() 与 requires_grad=False

python - 处理异常的推荐方法?

python - DynamoDB 更新多个值

machine-learning - 新闻文章的聚类

tensorflow - 通过 Huggingface Transformers 更新 BERT 模型

nlp - BERT 中长文本的滑动窗口用于问答

python - 使用plotnine绘制二维密度图(stat_密度_2d)

python - SQLAlchemy 和遍历 : How to add traversal required attributes for objects retrived from database?

python - JSON解码错误: Expecting value: line 1 column 1 (char 0) using Translate API