huggingface 集线器上有一组来自 sentence_transformers
库的模型,例如https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1
建议的用法示例是:
# Using sentence_transformers
from sentence_transformers import CrossEncoder
model_name = 'cross-encoder/mmarco-mMiniLMv2-L12-H384-v1'
model = CrossEncoder(model_name)
scores = model.predict([
['How many people live in Berlin?', 'How many people live in Berlin?'],
['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.']
])
scores
[输出]:
array([ 0.36782095, -4.2674575 ], dtype=float32)
或者
# From transformers.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
import torch
# cross-encoder/ms-marco-MiniLM-L-12-v2
model = AutoModelForSequenceClassification.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')
tokenizer = AutoTokenizer.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')
features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'],
['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'],
padding=True, truncation=True, return_tensors="pt")
model.eval()
with torch.no_grad():
scores = model(**features).logits
print(scores)
[输出]:
tensor([[10.7615],
[-8.1277]])
如果用户想要在这些交叉编码器模型上使用transformers.pipeline
,则会抛出错误:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
import torch
# cross-encoder/ms-marco-MiniLM-L-12-v2
model = AutoModelForSequenceClassification.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')
tokenizer = AutoTokenizer.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')
pipe = pipeline(model=model, tokenizer=tokenizer)
它抛出一个错误:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/tmp/ipykernel_108/785368641.py in <module>
----> 1 pipe = pipeline(model=model, tokenizer=tokenizer)
/opt/conda/lib/python3.7/site-packages/transformers/pipelines/__init__.py in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, use_auth_token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
711 if not isinstance(model, str):
712 raise RuntimeError(
--> 713 "Inferring the task automatically requires to check the hub with a model_id defined as a `str`."
714 f"{model} is not a valid model_id."
715 )
RuntimeError: Inferring the task automatically requires to check the hub with a model_id defined as a `str`.
问:如何将交叉编码器与 Huggingface Transformer 管道结合使用?
问:如果需要 model_id,是否可以将 model_id 作为 args
或 kwargs
添加到 pipeline
中?
有一个类似的问题Error: Inferring the task automatically requires to check the hub with a model_id defined as a `str`. AraBERT model但我不确定这是同一个问题,因为另一个问题是在 'aubmindlab/bert-base-arabertv02'
上,而不是在 sentence_transformers
模型的交叉编码器类上>.
最佳答案
经过多次尝试、错误和代码挖掘
这部分代码将包含可用的预编码管道任务列表
https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/__init__.py
这里记录了如何将文本对输入模型
https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/text_classification.py# L121
这是
text-classification
任务编码的一般用法,并且用法位于函数 https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/text_classification.py 的文档字符串中。
现在,这里...
TL;DR
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
import torch
# cross-encoder/ms-marco-MiniLM-L-12-v2
model = AutoModelForSequenceClassification.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')
tokenizer = AutoTokenizer.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
pipe([{"text": 'How many people live in Berlin?', "text_pair": 'Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.'},
{"text": 'How many people live in Berlin?', "text_pair": 'New York City is famous for the Metropolitan Museum of Art.'},
[输出]:
[{'label': 'LABEL_0', 'score': 0.99997878074646},
{'label': 'LABEL_0', 'score': 0.0002951461647171527},
{'label': 'LABEL_0', 'score': 0.027012893930077553}]
但输出与使用 sentence_transformers
不同!
是的,不是,因为 softmax 已应用于输出。
有一个在模型推理后应用的分类函数,位于 https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/text_classification.py#L27
class ClassificationFunction(ExplicitEnum):
SIGMOID = "sigmoid"
SOFTMAX = "softmax"
NONE = "none"
def postprocess(self, model_outputs, function_to_apply=None, top_k=1, _legacy=True):
# `_legacy` is used to determine if we're running the naked pipeline and in backward
# compatibility mode, or if running the pipeline with `pipeline(..., top_k=1)` we're running
# the more natural result containing the list.
# Default value before `set_parameters`
if function_to_apply is None:
if self.model.config.problem_type == "multi_label_classification" or self.model.config.num_labels == 1:
function_to_apply = ClassificationFunction.SIGMOID
elif self.model.config.problem_type == "single_label_classification" or self.model.config.num_labels > 1:
function_to_apply = ClassificationFunction.SOFTMAX
elif hasattr(self.model.config, "function_to_apply") and function_to_apply is None:
function_to_apply = self.model.config.function_to_apply
else:
function_to_apply = ClassificationFunction.NONE
TL;DR(这次是真的)
要复制推出您自己的 tokenize + 转发函数的结果,您必须显式设置分类函数并覆盖后处理函数,即
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
from transformers.pipelines.text_classification import ClassificationFunction
model = AutoModelForSequenceClassification.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')
tokenizer = AutoTokenizer.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, function_to_apply=ClassificationFunction.NONE)
pipe([{"text": 'How many people live in Berlin?', "text_pair": 'Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.'},
{"text": 'How many people live in Berlin?', "text_pair": 'New York City is famous for the Metropolitan Museum of Art.'},
{"text": 'Hello how are you?', "text_pair": "I'm fine, thank you"},
])
[输出]:
[{'label': 'LABEL_0', 'score': 10.761542320251465},
{'label': 'LABEL_0', 'score': -8.127744674682617},
{'label': 'LABEL_0', 'score': -3.5840566158294678}]
关于python - 如何将交叉编码器与 Huggingface 变压器管道一起使用?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/76079388/