python - 如何在 Keras 的 BLSTM 中屏蔽填充？

我正在运行基于 IMDB 的 BLSTM example ，但我的版本不是分类，而是标签的序列预测。为简单起见，您可以将其视为词性标注模型。输入是单词的句子，输出是标签。该示例中使用的语法与大多数其他 Keras 示例在语法上略有不同，因为它不使用 model.add 而是启动一个序列。我不知道如何使用这种略有不同的语法添加屏蔽层。

我已经运行并测试了该模型，它运行良好，但它预测和评估了 0 的准确性，这些 0 是我的填充。这是代码:

from __future__ import print_function
import numpy as np
from keras.preprocessing import sequence
from keras.models import Model
from keras.layers.core import Masking
from keras.layers import TimeDistributed, Dense
from keras.layers import Dropout, Embedding, LSTM, Input, merge
from prep_nn import prep_scan
from keras.utils import np_utils, generic_utils


np.random.seed(1337)  # for reproducibility
nb_words = 20000  # max. size of vocab
nb_classes = 10  # number of labels
hidden = 500  # 500 gives best results so far
batch_size = 10  # create and update net after 10 lines
val_split = .1
epochs = 15

# input for X is multi-dimensional numpy array with IDs,
# one line per array. input y is multi-dimensional numpy array with
# binary arrays for each value of each label.
# maxlen is length of longest line
print('Loading data...')
(X_train, y_train), (X_test, y_test) = prep_scan(
    nb_words=nb_words, test_len=75)

print(len(X_train), 'train sequences')
print(int(len(X_train)*val_split), 'validation sequences')
print(len(X_test), 'heldout sequences')

# this is the placeholder tensor for the input sequences
sequence = Input(shape=(maxlen,), dtype='int32')

# this embedding layer will transform the sequences of integers
# into vectors
embedded = Embedding(nb_words, output_dim=hidden,
                     input_length=maxlen)(sequence)

# apply forwards LSTM
forwards = LSTM(output_dim=hidden, return_sequences=True)(embedded)
# apply backwards LSTM
backwards = LSTM(output_dim=hidden, return_sequences=True,
                 go_backwards=True)(embedded)

# concatenate the outputs of the 2 LSTMs
merged = merge([forwards, backwards], mode='concat', concat_axis=-1)
after_dp = Dropout(0.15)(merged)

# TimeDistributed for sequence
# change activation to sigmoid?
output = TimeDistributed(
    Dense(output_dim=nb_classes,
          activation='softmax'))(after_dp)

model = Model(input=sequence, output=output)

# try using different optimizers and different optimizer configs
# loss=binary_crossentropy, optimizer=rmsprop
model.compile(loss='categorical_crossentropy',
              metrics=['accuracy'], optimizer='adam')

print('Train...')
model.fit(X_train, y_train,
          batch_size=batch_size,
          nb_epoch=epochs,
          shuffle=True,
          validation_split=val_split)

更新:

我合并了这个PR并让它与嵌入层中的 mask_zero=True 一起工作。但我现在意识到，在看到模型的糟糕性能后，我还需要在输出中进行屏蔽，其他人建议在 model.fit 行中使用 sample_weight 代替。我该怎么做才能忽略 0？

更新 2:

所以我读了this并计算出 sample_weight 作为 1 和 0 的矩阵。我认为它可能一直在工作，但我的准确性停滞在 %50 左右，我刚刚发现它正在尝试预测填充部分，但现在不会将它们预测为 0，就像使用 sample_weight 之前的问题一样。

当前代码:

from __future__ import print_function
import numpy as np
from keras.preprocessing import sequence
from keras.models import Model
from keras.layers.core import Masking
from keras.layers import TimeDistributed, Dense
from keras.layers import Dropout, Embedding, LSTM, Input, merge
from prep_nn import prep_scan
from keras.utils import np_utils, generic_utils
import itertools
from itertools import chain
from sklearn.preprocessing import LabelBinarizer
import sklearn
import pandas as pd


np.random.seed(1337)  # for reproducibility
nb_words = 20000  # max. size of vocab
nb_classes = 10  # number of labels
hidden = 500  # 500 gives best results so far
batch_size = 10  # create and update net after 10 lines
val_split = .1
epochs = 10

# input for X is multi-dimensional numpy array with syll IDs,
# one line per array. input y is multi-dimensional numpy array with
# binary arrays for each value of each label.
# maxlen is length of longest line
print('Loading data...')
(X_train, y_train), (X_test, y_test), maxlen, sylls_ids, tags_ids, weights = prep_scan(nb_words=nb_words, test_len=75)

print(len(X_train), 'train sequences')
print(int(len(X_train) * val_split), 'validation sequences')
print(len(X_test), 'heldout sequences')

# this is the placeholder tensor for the input sequences
sequence = Input(shape=(maxlen,), dtype='int32')

# this embedding layer will transform the sequences of integers
# into vectors of size 256
embedded = Embedding(nb_words, output_dim=hidden,
                     input_length=maxlen, mask_zero=True)(sequence)

# apply forwards LSTM
forwards = LSTM(output_dim=hidden, return_sequences=True)(embedded)
# apply backwards LSTM
backwards = LSTM(output_dim=hidden, return_sequences=True,
                 go_backwards=True)(embedded)

# concatenate the outputs of the 2 LSTMs
merged = merge([forwards, backwards], mode='concat', concat_axis=-1)
# after_dp = Dropout(0.)(merged)

# TimeDistributed for sequence
# change activation to sigmoid?
output = TimeDistributed(
    Dense(output_dim=nb_classes,
          activation='softmax'))(merged)

model = Model(input=sequence, output=output)

# try using different optimizers and different optimizer configs
# loss=binary_crossentropy, optimizer=rmsprop
model.compile(loss='categorical_crossentropy',
              metrics=['accuracy'], optimizer='adam',
              sample_weight_mode='temporal')

print('Train...')
model.fit(X_train, y_train,
          batch_size=batch_size,
          nb_epoch=epochs,
          shuffle=True,
          validation_split=val_split,
          sample_weight=weights)

最佳答案

你解决了这个问题吗？我不太清楚您的代码如何处理填充值和单词索引。让单词索引从1开始定义

怎么样

embedded = Embedding(nb_words + 1, output_dim=hidden,
                 input_length=maxlen, mask_zero=True)(sequence)

代替

embedded = Embedding(nb_words, output_dim=hidden,
                 input_length=maxlen, mask_zero=True)(sequence)

根据 https://keras.io/layers/embeddings/ ？

关于python - 如何在 Keras 的 BLSTM 中屏蔽填充？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37817588/

python - 如何在 Keras 的 BLSTM 中屏蔽填充？

上一篇：python - sknn - 第二次拟合时输入尺寸不匹配

下一篇：Python，为什么mmap.move() 会填满内存？