machine-learning - BERT多类别情感分析准确率低？

我正在研究一个小数据集:

包含1500篇新闻文章。
所有这些文章均由人类按照 5 分制的情绪/积极程度进行排名。
没有拼写错误。在导入分析之前，我使用谷歌表检查拼写。仍然有一些字符编码不正确，但不多。
平均长度大于 512 个单词。
数据集稍微不平衡。

我认为这是一个多类分类问题，我想用这个数据集微调 BERT。为了做到这一点，我使用了 Ktrain 包，并且基本上遵循了教程。下面是我的代码:

(x_train, y_train), (x_test, y_test), preproc = text.texts_from_array(
                                                                    x_train=x_train, 
                                                                    y_train=y_train,
                                                                    x_test=x_test, 
                                                                    y_test=y_test,
                                                                    class_names=categories,
                                                                    preprocess_mode='bert',
                                                                    maxlen= 510,
                                                                    max_features=35000)

model = text.text_classifier('bert', train_data=(x_train, y_train), preproc=preproc)
learner = ktrain.get_learner(model, train_data=(x_train, y_train), batch_size=6)
learner.fit_onecycle(2e-5, 4)

但是，我的验证准确度只有 25% 左右，这太低了。

          precision-recall f1-score support

   1       0.33      0.40      0.36        75
   2       0.27      0.36      0.31        84
   3       0.23      0.24      0.23        58
   4       0.18      0.09      0.12        54
   5       0.33      0.04      0.07        24

accuracy                               0.27       295
macro avg          0.27      0.23      0.22       295
weighted avg       0.26      0.27      0.25       295

我还尝试了 head+tail 截断策略，因为有些文章很长，但性能保持不变。

谁能给我一些建议吗？

非常感谢!

最佳

徐

==================更新7.21==================

按照 Kartickey 的建议，我尝试了 find_lr。下面是结果。看来2e^-5是一个合理的学习率。

simulating training for different learning rates... this may take a few 
moments...
Train on 1182 samples
Epoch 1/2
1182/1182 [==============================] - 223s 188ms/sample - loss: 1.6878 
- accuracy: 0.2487
Epoch 2/2
432/1182 [=========>....................] - ETA: 2:12 - loss: 3.4780 - 
accuracy: 0.2639
done.
Visually inspect loss plot and select learning rate associated with falling 
loss

learning rate.jpg

我只是尝试用一些权重来运行它:

{0: 0,
 1: 0.8294736842105264,
 2: 0.6715909090909091,
 3: 1.0844036697247708,
 4: 1.1311004784688996,
 5: 2.0033898305084747}

这是结果。没有太大变化。

          precision    recall  f1-score   support

       1       0.43      0.27      0.33        88
       2       0.22      0.46      0.30        69
       3       0.19      0.09      0.13        64
       4       0.13      0.13      0.13        47
       5       0.16      0.11      0.13        28

accuracy                            0.24       296
macro avg       0.23      0.21      0.20       296
weighted avg    0.26      0.24      0.23       296

array([[24, 41,  9,  8,  6],
       [13, 32,  6, 12,  6],
       [ 9, 33,  6, 14,  2],
       [ 4, 25, 10,  6,  2],
       [ 6, 14,  0,  5,  3]])

==============更新7.22 =============

为了获得一些基线结果，我将 5 分制的分类问题折叠为二元分类问题，这只是为了预测正面或负面。这次准确率提高到了 55% 左右。下面是我的策略的详细描述:

training data: 956 samples (excluding those classified as neutural)
truncation strategy: use the first 128 and last 128 tokens
(x_train,  y_train), (x_test, y_test), preproc_l1 = 
                     text.texts_from_array(x_train=x_train, y_train=y_train,    
                     x_test=x_test, y_test=y_test                      
                     class_names=categories_1,                      
                     preprocess_mode='bert',                                                          
                     maxlen=  256,                                                                  
                     max_features=35000)
Results:
              precision    recall  f1-score   support

       1       0.65      0.80      0.72       151
       2       0.45      0.28      0.35        89

accuracy                               0.61       240
macro avg          0.55      0.54      0.53       240
weighted avg       0.58      0.61      0.58       240

array([[121,  30],
       [ 64,  25]])

但是，我认为 55% 仍然不是令人满意的准确度，比随机猜测稍好一些。

============更新7.26============

根据 Marcos Lima 的建议，我在程序中添加了几个额外步骤:

在 Ktrain pkg 进行预处理之前删除所有数字、标点符号和冗余空格。 (我认为 Ktrain pkg 会为我做这个，但不确定)
我使用示例中任何文本的前 384 个和最后 128 个标记。这就是我所说的“头+尾”策略。
任务仍然是二元分类(正与负)

这是学习曲线图。它仍然和我之前发布的一样。它看起来仍然与 Marcos Lima 发布的非常不同:

The updated learning curve

下面是我的结果，这可能是我得到的最好的结果。

begin training using onecycle policy with max lr of 1e-05...
Train on 1405 samples
Epoch 1/4
1405/1405 [==============================] - 186s 133ms/sample - loss: 0.7220 
- accuracy: 0.5431
Epoch 2/4
1405/1405 [==============================] - 167s 119ms/sample - loss: 0.6866 
- accuracy: 0.5843
Epoch 3/4
1405/1405 [==============================] - 166s 118ms/sample - loss: 0.6565 
- accuracy: 0.6335
Epoch 4/4
1405/1405 [==============================] - 166s 118ms/sample - loss: 0.5321 
- accuracy: 0.7587

             precision    recall  f1-score   support

       1       0.77      0.69      0.73       241
       2       0.46      0.56      0.50       111

accuracy                           0.65       352
macro avg       0.61      0.63      0.62       352
weighted avg       0.67      0.65      0.66       352

array([[167,  74],
       [ 49,  62]])

注意:我认为 pkg 很难很好地完成我的任务的原因可能是这个任务就像分类和情感分析的结合。新闻文章的经典分类任务是对新闻所属的类别进行分类，例如生物、经济、体育。不同类别使用的词语有很大不同。另一方面，情感分类的经典示例是分析 Yelp 或 IMDB 评论。我的猜测是，这些文本在表达情绪时非常简单，而我的样本(经济新闻)中的文本在发布前经过了精心打磨和组织良好，因此情绪可能总是以某种 BERT 可能无法检测到的隐式方式出现。

最佳答案

尝试超参数优化。

在执行learner.fit_onecycle(2e-5, 4)之前。尝试:learner.lr_find(show_plot=True, max_epochs=2)

所有类别的权重都在 20% 左右吗？也许尝试一下这种时尚的东西:

MODEL_NAME = 'bert'
t = text.Transformer(MODEL_NAME, maxlen=500, class_names=train_b.target_names)

.....
.....

# the one we got most wrong
learner.view_top_losses(n=1, preproc=t)

对于上述类别增加权重。

验证集是分层抽样还是随机抽样？

关于machine-learning - BERT多类别情感分析准确率低？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/63005475/

machine-learning - BERT多类别情感分析准确率低？

上一篇：python - users().messages().list() 似乎上限为 500 条消息

下一篇：django - 保存时文本区域中的换行符数量加倍