nlp - 微调 DistilBertForSequenceClassification : Is not learning, 为什么损失没有改变?权重没有更新?

标签 nlp pytorch text-classification loss-function huggingface-transformers

我对 PyTorch 和 Huggingface-transformers 比较陌生,并在此 Kaggle-Dataset 上试验过 DistillBertForSequenceClassification .

from transformers import DistilBertForSequenceClassification
import torch.optim as optim
import torch.nn as nn
from transformers import get_linear_schedule_with_warmup

n_epochs = 5 # or whatever
batch_size = 32 # or whatever

bert_distil = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
#bert_distil.classifier = nn.Sequential(nn.Linear(in_features=768, out_features=1), nn.Sigmoid())
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(bert_distil.parameters(), lr=0.1)

X_train = []
Y_train = []

for row in train_df.iterrows():
    seq = tokenizer.encode(preprocess_text(row[1]['text']),  add_special_tokens=True, pad_to_max_length=True)
    X_train.append(torch.tensor(seq).unsqueeze(0))
    Y_train.append(torch.tensor([row[1]['target']]).unsqueeze(0))
X_train = torch.cat(X_train)
Y_train = torch.cat(Y_train)

running_loss = 0.0
bert_distil.cuda()
bert_distil.train(True)
for epoch in range(n_epochs):
    permutation = torch.randperm(len(X_train))
    j = 0
    for i in range(0,len(X_train), batch_size):
        optimizer.zero_grad()
        indices = permutation[i:i+batch_size]
        batch_x, batch_y = X_train[indices], Y_train[indices]
        batch_x.cuda()
        batch_y.cuda()
        outputs = bert_distil.forward(batch_x.cuda())
        loss = criterion(outputs[0],batch_y.squeeze().cuda())
        loss.requires_grad = True
   
        loss.backward()
        optimizer.step()
   
        running_loss += loss.item()  
        j+=1
        if j == 20:   
            #print(outputs[0])
            print('[%d, %5d] running loss: %.3f loss: %.3f ' %
              (epoch + 1, i*1, running_loss / 20, loss.item()))
            running_loss = 0.0
            j = 0

[1, 608] running loss: 0.689 loss: 0.687 [1, 1248] running loss: 0.693 loss: 0.694 [1, 1888] running loss: 0.693 loss: 0.683 [1, 2528] running loss: 0.689 loss: 0.701 [1, 3168] running loss: 0.690 loss: 0.684 [1, 3808] running loss: 0.689 loss: 0.688 [1, 4448] running loss: 0.689 loss: 0.692 etc...

无论我尝试什么,损失都没有减少,甚至没有增加,预测也没有变得更好。在我看来,我忘记了一些东西,以至于权重实际上没有更新。有人有想法吗? 啊

我尝试过的

  • 不同的损失函数
    • 公元前
    • 交叉熵
    • 甚至 MSE 损失
  • One-Hot 编码与单个神经元输出
  • 不同的学习率和优化器
  • 我什至将所有目标更改为只有一个标签,但即便如此,网络也没有收敛。

最佳答案

查看运行损失和小批量损失很容易产生误导。您应该查看时期损失,因为每次损失的输入都是相同的。

此外,您的代码中存在一些问题,修复了所有问题并且行为符合预期:损失在每个 epoch 后缓慢下降,并且它也可能过拟合到一个小的 minibatch。请看代码,变化包括:使用model(x)代替model.forward(x)cuda()只调用一次, 较小的学习率等。

调整和微调 ML 模型是一项艰巨的工作。

n_epochs = 5
batch_size = 1

bert_distil = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(bert_distil.parameters(), lr=1e-3)

X_train = []
Y_train = []
for row in train_df.iterrows():
    seq = tokenizer.encode(row[1]['text'],  add_special_tokens=True, pad_to_max_length=True)[:100]
    X_train.append(torch.tensor(seq).unsqueeze(0))
    Y_train.append(torch.tensor([row[1]['target']]))
X_train = torch.cat(X_train)
Y_train = torch.cat(Y_train)

running_loss = 0.0
bert_distil.cuda()
bert_distil.train(True)
for epoch in range(n_epochs):
    permutation = torch.randperm(len(X_train))
    for i in range(0,len(X_train), batch_size):
        optimizer.zero_grad()
        indices = permutation[i:i+batch_size]
        batch_x, batch_y = X_train[indices].cuda(), Y_train[indices].cuda()
        outputs = bert_distil(batch_x)
        loss = criterion(outputs[0], batch_y)
        loss.backward()
        optimizer.step()
   
        running_loss += loss.item()  

    print('[%d] epoch loss: %.3f' %
      (epoch + 1, running_loss / len(X_train) * batch_size))
    running_loss = 0.0

输出:

[1] epoch loss: 0.695
[2] epoch loss: 0.690
[3] epoch loss: 0.687
[4] epoch loss: 0.685
[5] epoch loss: 0.684

关于nlp - 微调 DistilBertForSequenceClassification : Is not learning, 为什么损失没有改变?权重没有更新?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63218778/

相关文章:

java - 自然语言处理

python - PyTorch 中带有 dropout 正则化的逻辑回归

python - PyTorch:在张量的单维上应用映射

machine-learning - 用于网站分类的简单机器学习

machine-learning - 为什么word2vec词汇长度与词向量长度不同

python - 有人知道 yarowsky 算法的实现吗?

java - 如何从文本中检索各种日期和时间值

python - 有没有更好的方法根据大小写进行名称分类?

python - 了解 PyTorch CNN channel

python-2.7 - 在特定文件上测试 NLTK 分类器