python-3.x - torch 错误 "RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows"

标签 python-3.x pytorch vectorization word-embedding huggingface-transformers

我使用 BiobertEmbedding python 模块 ( https://pypi.org/project/biobert-embedding/ ) 的 sentence_vector() 方法对句子进行矢量化。对于某些句子组我没有问题,但对于其他一些句子我有以下错误消息:

File "/home/nobunaga/.local/lib/python3.6/site-packages/biobert_embedding/embedding.py", line 133, in sentence_vector encoded_layers = self.eval_fwdprop_biobert(tokenized_text) File "/home/nobunaga/.local/lib/python3.6/site-packages/biobert_embedding/embedding.py", line 82, in eval_fwdprop_biobert encoded_layers, _ = self.model(tokens_tensor, segments_tensors) File "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs) File "/home/nobunaga/.local/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py", line 730, in forward embedding_output = self.embeddings(input_ids, token_type_ids) File "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs) File "/home/nobunaga/.local/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py", line 268, in forward position_embeddings = self.position_embeddings(position_ids) File "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs) File "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 114, in forward self.norm_type, self.scale_grad_by_freq, self.sparse) File "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/functional.py", line 1467, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows. at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:237

我发现对于某些句子组,问题与 <tb> 等标签有关例如。但是对于其他人来说,即使删除了标签,错误信息仍然存在。
(不幸的是,出于保密原因,我不能分享代码)

您对可能出现的问题有什么想法吗?

提前致谢

编辑:你是对的 cronoik,举个例子会更好。

示例:

sentences = ["This is the first sentence.", "This is the second sentence.", "This is the third sentence."

biobert = BiobertEmbedding(model_path='./biobert_v1.1_pubmed_pytorch_model')

vectors = [biobert.sentence_vector(doc) for doc in sentences]

在我看来,这最后一行代码是导致错误消息的原因。

最佳答案

问题是 biobert-embedding 模块没有处理最大序列长度 512(标记而不是单词!)。这是相关的 source code 。查看下面的示例以强制执行您收到的错误:

from biobert_embedding.embedding import BiobertEmbedding
#sentence has 385 words
sentence = "The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control characters, or using value in the range from 128 to 255. Using values above 128 conflicts with using the 8th bit as a checksum, but the checksum usage gradually died out. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control"
longersentence = sentence + ' some'

biobert = BiobertEmbedding()
print('sentence has {} tokens'.format(len(biobert.process_text(sentence))))
#works
biobert.sentence_vector(sentence)
print('longersentence has {} tokens'.format(len(biobert.process_text(longersentence))))
#didn't work
biobert.sentence_vector(longersentence)

输出:

sentence has 512 tokens
longersentence has 513 tokens
#your error message....

你应该做的是实现一个 sliding window approach 来处理这些文本:

import torch
from biobert_embedding.embedding import BiobertEmbedding

maxtokens = 512
startOffset = 0
docStride = 200

sentence = "The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control characters, or using value in the range from 128 to 255. Using values above 128 conflicts with using the 8th bit as a checksum, but the checksum usage gradually died out. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control"
longersentence = sentence + ' some'

sentences = [sentence, longersentence, 'small test sentence']
vectors = []
biobert = BiobertEmbedding()

#https://github.com/Overfitter/biobert_embedding/blob/b114e3456de76085a6cf881ff2de48ce868e6f4b/biobert_embedding/embedding.py#L127
def sentence_vector(tokenized_text, biobert):
    encoded_layers = biobert.eval_fwdprop_biobert(tokenized_text)

    # `encoded_layers` has shape [12 x 1 x 22 x 768]
    # `token_vecs` is a tensor with shape [22 x 768]
    token_vecs = encoded_layers[11][0]

    # Calculate the average of all 22 token vectors.
    sentence_embedding = torch.mean(token_vecs, dim=0)
    return sentence_embedding


for doc in sentences:
    #tokenize your text
    docTokens = biobert.process_text(doc)
    
    while startOffset < len(docTokens):
        print(startOffset)
        length = min(len(docTokens) - startOffset, maxtokens)

        #now we calculate the sentence_vector for the document slice
        vectors.append(sentence_vector(
                        docTokens[startOffset:startOffset+length]
                        , biobert)
                      )
        #stop when the whole document is processed (document has less than 512
        #or the last document slice was processed)
        if startOffset + length == len(docTokens):
            break
        startOffset += min(length, docStride)
    startOffset = 0

P.S.:您删除 <tb> 的部分成功是可能的,因为删除 <tb> 将删除 4 个标记('<'、't'、'##b'、'>')。

关于python-3.x - torch 错误 "RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62598130/

相关文章:

html - Python3.5 BeautifulSoup4 从div中的 'p'获取文本

r - 为矩阵向量化 min()

r - 如何在数据帧计算中用矢量函数替换 R 中的 for 循环?

python - 重写方法签名时如何避免违反里氏替换原则

python - 如何使用字典理解计算键子字符串匹配的字典值小计

python - Python 属性函数的 C 代码?

python - 无效的多项分布(遇到概率条目 < 0)at/pytorch/aten/src/TH/generic/THTensorRandom.cpp :325

python - 如何理解这段python代码?

python - PyTorch 线性代数梯度

python - 在 Numpy 中将 sRGB 向量化为线性转换