machine-learning - NLP 变压器 : Best way to get a fixed sentence embedding-vector shape?

我正在从 torch hub(CamemBERT 基于法语 RoBERTa 的模型)加载语言模型，并使用它嵌入一些法语句子:

import torch
camembert = torch.hub.load('pytorch/fairseq', 'camembert.v0')
camembert.eval()  # disable dropout (or leave in train mode to finetune)


def embed(sentence):
   tokens = camembert.encode(sentence)
   # Extract all layer's features (layer 0 is the embedding layer)
   all_layers = camembert.extract_features(tokens, return_all_hiddens=True)
   embeddings = all_layers[0]
   return embeddings

# Here we see that the shape of the embedding vector depends on the number of tokens in the sentence

u = embed(sentence="Bonjour, ça va ?")
u.shape # torch.Size([1, 7, 768])
v = embed(sentence="Salut, comment vas-tu ?")
v.shape # torch.Size([1, 9, 768])

现在想象一下，为了进行一些语义搜索，我想计算向量(在我们的例子中为张量)之间的余弦距离u 和 v :

cos = torch.nn.CosineSimilarity(dim=1)
cos(u, v) # will throw an error since the shape of `u` is different from the shape of `v`

我问的是，为了始终获得句子相同的嵌入形状(无论其标记的数量)，最好使用什么方法？

=> 我想到的第一个解决方案是计算 axis=1 上的平均值(句子的嵌入是嵌入其标记的平均值)，因为 axis=0 和 axis=2 总是相同尺寸:

cos = torch.nn.CosineSimilarity(dim=1)
cos(u.mean(axis=1), v.mean(axis=1)) # works now and gives 0.7269

但是，我担心在计算平均值时会损害句子的嵌入，因为它为每个标记赋予相同的权重(可能乘以TF-IDF？)。

=> 第二种解决方案是填充较短的句子。这意味着:

一次给出要嵌入的句子列表(而不是逐句嵌入)
查找具有最长标记的句子并将其嵌入，获取其形状S
对于其余的句子，嵌入然后填充零以获得相同的形状S(句子的其余维度为 0)

你有什么想法？您还会使用哪些其他技术以及为什么？

提前致谢!

最佳答案

这是一个非常笼统的问题，因为没有特定的正确答案。

正如您所发现的，形状当然有所不同，因为每个标记都会得到一个输出(取决于分词器，这些可以是子字单元)。换句话说，您已将所有标记编码到它们自己的向量中。您想要的是一个句子嵌入，并且有多种方法可以获取这些内容(没有一个具体正确的答案)。

特别是对于句子分类，当语言模型经过训练后，我们经常使用特殊分类标记的输出(CamemBERT 使用 <s> )。请注意，根据模型的不同，这可以是第一个(主要是 BERT 和子项；还有 CamemBERT)或最后一个标记(CTRL、GPT2、OpenAI、XLNet)。我建议在可用时使用此选项，因为该 token 正是为此目的进行训练的。

如果 [CLS] (或 <s> 或类似) token 不可用，还有一些其他选项属于术语池。经常使用最大池化和平均池化。这意味着您采用最大值标记或所有标记的平均值。正如您所说，“危险”是您将整个句子的向量值减少到“某个平均值”或“某个最大值”，这可能不太能代表该句子。然而，文献表明这也很有效。

正如另一个答案所暗示的那样，您使用的输出层也可以发挥作用。 IIRC 关于 BERT 的 Google 论文表明，他们在连接最后四层时获得了最佳分数。这是更高级的内容，除非有要求，否则我不会在这里详细讨论。

我没有 fairseq 的经验，但使用 transformers库，我会写这样的东西(CamemBERT 从 v2.2.0 开始在库中可用):

import torch
from transformers import CamembertModel, CamembertTokenizer

text = "Salut, comment vas-tu ?"

tokenizer = CamembertTokenizer.from_pretrained('camembert-base')

# encode() automatically adds the classification token <s>
token_ids = tokenizer.encode(text)
tokens = [tokenizer._convert_id_to_token(idx) for idx in token_ids]
print(tokens)

# unsqueeze token_ids because batch_size=1
token_ids = torch.tensor(token_ids).unsqueeze(0)
print(token_ids)

# load model
model = CamembertModel.from_pretrained('camembert-base')

# forward method returns a tuple (we only want the logits)
# squeeze() because batch_size=1
output = model(token_ids)[0].squeeze()
# only grab output of CLS token (<s>), which is the first token
cls_out = output[0]
print(cls_out.size())

打印输出(按顺序)是标记化后的标记、标记 ID 和最终大小。

['<s>', '▁Salut', ',', '▁comment', '▁vas', '-', 'tu', '▁?', '</s>']
tensor([[   5, 5340,    7,  404, 4660,   26,  744,  106,    6]])
torch.Size([768])

关于machine-learning - NLP 变压器 : Best way to get a fixed sentence embedding-vector shape?，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59030907/

machine-learning - NLP 变压器 : Best way to get a fixed sentence embedding-vector shape?

上一篇：machine-learning - U 矩阵和自组织映射

下一篇：machine-learning - 如何使用tensorflow的merge和switch功能？