nlp - 使用 Google T5 进行词嵌入？

标签 nlp lm huggingface-transformers word-embedding language-model

是否可以使用 Google 的 T5 生成词嵌入？

我假设这是可能的。但是，我找不到能够在相关 Github ( https://github.com/google-research/text-to-text-transfer-transformer ) 或 HuggingFace ( https://huggingface.co/docs/transformers/model_doc/t5 ) 页面上生成词嵌入的代码。

最佳答案

是的，这是可能的。只需将单词的 id 提供给单词嵌入层即可:

from transformers import T5TokenizerFast, T5EncoderModel

tokenizer = T5TokenizerFast.from_pretrained("t5-small")
model = T5EncoderModel.from_pretrained("t5-small")
i = tokenizer(
    "This is a meaningless test sentence to show how you can get word embeddings", return_tensors="pt", return_attention_mask=False, add_special_tokens=False
)

o = model.encoder.embed_tokens(i.input_ids)

输出张量具有以下形状:

#print(o.shape)
torch.Size([1, 19, 512])

这 19 个向量是每个 token 的表示。根据您的任务，您可以使用 word_ids 将它们映射回各个单词。 :

i.word_ids()

输出:

[0, 1, 2, 2, 3, 3, 3, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12, 12, 12]

关于nlp - 使用 Google T5 进行词嵌入？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/72451171/

上一篇：latex - LaTex 中出现错误 "missing } inserted. end{thebibliography}"

下一篇：powershell - 如何在 powershell 中识别具有多个值的哈希表键

相关文章：

r - 如何进行 lm.ridge 摘要？

python - 如何为 TFRobertaSequenceClassification 指定目标类的数量？

java - 在 Stanford CoreNLP 流水线中输入 Penn Treebank 组成树

nlp - 以conll格式输出结果(POS-tagging, stanford pos tagger)

java - 从一个句子生成 N-gram

r - gradDescent 包和 lm 函数不同

r - 如何仅获取 lm 对象上的特定摘要行

python - pytorch 或 Huggingface/transformer 标签代码中的何处将 "renamed"放入标签中？

huggingface-transformers - 为什么 huggingface t5 tokenizer 会忽略一些空格？

machine-learning - 如何仅使用正面和中性数据训练分类器？