python - Keras 验证损失计算不正确

我的模型在 CPU 机器上训练得很好，但是当我尝试在我们的集群上重新运行它时遇到了问题(使用单个 GPU 和相同的数据集)。当在 GPU 机器上训练时，验证损失和准确性并没有从一个时期到另一个时期提高(见下文)。在 CPU 机器上情况并非如此(我能够在 20 个时期后达到 ~0.8 的验证准确性)

详细信息:

喀拉斯 2.1.3

TensforFlow 后端

70/20/10 训练/开发/测试

~ 7000 张图片

模型基于ResNet50

代码

    import sys
    import math
    import os
    import glob

    from keras import applications
    from keras.preprocessing.image import ImageDataGenerator
    from keras import optimizers
    from keras.models import Sequential, Model 
    from keras.layers import Flatten, Dense
    from keras import backend as k 
    from keras.callbacks import ModelCheckpoint, CSVLogger, EarlyStopping

    ############ Training parameters ##################
    img_width, img_height = 224, 224
    batch_size = 32
    epochs = 100


    ############ Define the data ##################
    train_data_dir = '/mnt/data/train'
    validation_data_dir = '/mnt/data/validate'

    train_data_dir_class1 = os.path.join(train_data_dir,'class1', '*.jpg')
    train_data_dir_class2 = os.path.join(train_data_dir, 'class2', '*.jpg')

    validation_data_dir_class1 = os.path.join(validation_data_dir, 'class1', '*.jpg')
    validation_data_dir_class2 = os.path.join(validation_data_dir, 'class2', '*.jpg')

    # number of training and validation samples
    nb_train_samples = len(glob.glob(train_data_dir_class1)) + len(glob.glob(train_data_dir_class2))
    nb_validation_samples = len(glob.glob(validation_data_dir_class1)) + len(glob.glob(validation_data_dir_class2))


    ############ Define the model ##################
    model = applications.resnet50.ResNet50(weights = "imagenet",
                                           include_top = False,
                                           input_shape = (img_width, img_height, 3))

    for layer in model.layers:
        layer.trainable = False

    # Adding a FC layer
    x = model.output
    x = Flatten()(x)
    predictions = Dense(1, activation = "sigmoid")(x)

    # creating the final model 
    model_final = Model(inputs = model.input, outputs = predictions)

    # compile the model 
    model_final.compile(loss = "binary_crossentropy",
                        optimizer = optimizers.Adam(lr = 0.001,
                                                    beta_1 = 0.9,
                                                    beta_2 = 0.999,
                                                    epsilon = 1e-10),
                        metrics = ["accuracy"])

    # train and test generators 
    train_datagen = ImageDataGenerator(rescale = 1./255,
                                       horizontal_flip = True,
                                       fill_mode = "nearest",
                                       zoom_range = 0.3,
                                       width_shift_range = 0.3,
                                       height_shift_range = 0.3,
                                       rotation_range = 30)

    test_datagen = ImageDataGenerator(rescale = 1./255)

    train_generator = train_datagen.flow_from_directory(train_data_dir,
                                                        target_size = (img_height, img_width),
                                                        batch_size = batch_size,
                                                        class_mode = "binary",
                                                        seed = 2018)

    validation_generator = test_datagen.flow_from_directory(validation_data_dir,
                                                            target_size = (img_height, img_width),
                                                            class_mode = "binary",
                                                            seed = 2018)

    early = EarlyStopping(monitor = 'val_loss', min_delta = 10e-5, patience = 10, verbose = 1, mode = 'auto')
    performance_log = CSVLogger('/mnt/results/vanilla_model_log.csv', separator = ',', append = False)

    # Train the model
    model_final.fit_generator(generator = train_generator,
                              steps_per_epoch = math.ceil(train_generator.samples / batch_size),
                              epochs = epochs,
                              validation_data = validation_generator,
                              validation_steps = math.ceil(validation_generator.samples / batch_size),
                              callbacks = [early, performance_log])

    # Save the model
    model_final.save('/mnt/results/vanilla_model.h5')

训练日志

Epoch 1/100
151/151 [==============================] - 237s 2s/step - loss: 0.7234 - acc: 0.5240 - val_loss: 0.9899 - val_acc: 0.5425
Epoch 2/100
151/151 [==============================] - 65s 428ms/step - loss: 0.6491 - acc: 0.6228 - val_loss: 1.0248 - val_acc: 0.5425
Epoch 3/100
151/151 [==============================] - 65s 429ms/step - loss: 0.6091 - acc: 0.6648 - val_loss: 1.0377 - val_acc: 0.5425
Epoch 4/100
151/151 [==============================] - 64s 426ms/step - loss: 0.5829 - acc: 0.6968 - val_loss: 1.0459 - val_acc: 0.5425
Epoch 5/100
151/151 [==============================] - 64s 427ms/step - loss: 0.5722 - acc: 0.7070 - val_loss: 1.0472 - val_acc: 0.5425
Epoch 6/100
151/151 [==============================] - 64s 427ms/step - loss: 0.5582 - acc: 0.7166 - val_loss: 1.0501 - val_acc: 0.5425
Epoch 7/100
151/151 [==============================] - 64s 424ms/step - loss: 0.5535 - acc: 0.7188 - val_loss: 1.0492 - val_acc: 0.5425
Epoch 8/100
151/151 [==============================] - 64s 426ms/step - loss: 0.5377 - acc: 0.7287 - val_loss: 1.0209 - val_acc: 0.5425
Epoch 9/100
151/151 [==============================] - 64s 425ms/step - loss: 0.5328 - acc: 0.7368 - val_loss: 1.0062 - val_acc: 0.5425
Epoch 10/100
151/151 [==============================] - 65s 432ms/step - loss: 0.5296 - acc: 0.7381 - val_loss: 1.0016 - val_acc: 0.5425
Epoch 11/100
151/151 [==============================] - 65s 430ms/step - loss: 0.5231 - acc: 0.7419 - val_loss: 1.0021 - val_acc: 0.5425

由于我能够在 CPU 机器上获得良好的结果，我假设验证损失/准确性必须在每个时期结束时计算不正确。为了测试这个理论，我使用训练集作为验证集:如果验证损失/准确度计算正确，我们应该看到大致相同的训练和验证损失和准确度值。正如您在下面看到的，验证损失值与训练损失值不同，这让我相信验证损失在每个时期结束时计算不正确。 为什么会这样？有哪些可能的解决方案？

修改后的代码

    import sys
    import math
    import os
    import glob

    from keras import applications
    from keras.preprocessing.image import ImageDataGenerator
    from keras import optimizers
    from keras.models import Sequential, Model 
    from keras.layers import Flatten, Dense
    from keras import backend as k 
    from keras.callbacks import ModelCheckpoint, CSVLogger, EarlyStopping

    ############ Training parameters ##################
    img_width, img_height = 224, 224
    batch_size = 32
    epochs = 100


    ############ Define the data ##################
    train_data_dir = '/mnt/data/train'
    validation_data_dir = '/mnt/data/train' # redefined validation set to test accuracy of validation loss/accuracy calculations

    train_data_dir_class1 = os.path.join(train_data_dir,'class1', '*.jpg')
    train_data_dir_class2 = os.path.join(train_data_dir, 'class2', '*.jpg')

    validation_data_dir_class1 = os.path.join(validation_data_dir, 'class1', '*.jpg')
    validation_data_dir_class2 = os.path.join(validation_data_dir, 'class2', '*.jpg')

    # number of training and validation samples
    nb_train_samples = len(glob.glob(train_data_dir_class1)) + len(glob.glob(train_data_dir_class2))
    nb_validation_samples = len(glob.glob(validation_data_dir_class1)) + len(glob.glob(validation_data_dir_class2))


    ############ Define the model ##################
    model = applications.resnet50.ResNet50(weights = "imagenet",
                                           include_top = False,
                                           input_shape = (img_width, img_height, 3))

    for layer in model.layers:
        layer.trainable = False

    # Adding a FC layer
    x = model.output
    x = Flatten()(x)
    predictions = Dense(1, activation = "sigmoid")(x)

    # creating the final model 
    model_final = Model(inputs = model.input, outputs = predictions)

    # compile the model 
    model_final.compile(loss = "binary_crossentropy",
                        optimizer = optimizers.Adam(lr = 0.001,
                                                    beta_1 = 0.9,
                                                    beta_2 = 0.999,
                                                    epsilon = 1e-10),
                        metrics = ["accuracy"])

    # train and test generators 
    train_datagen = ImageDataGenerator(rescale = 1./255,
                                       horizontal_flip = True,
                                       fill_mode = "nearest",
                                       zoom_range = 0.3,
                                       width_shift_range = 0.3,
                                       height_shift_range = 0.3,
                                       rotation_range = 30)

    test_datagen = ImageDataGenerator(rescale = 1./255)

    train_generator = train_datagen.flow_from_directory(train_data_dir,
                                                        target_size = (img_height, img_width),
                                                        batch_size = batch_size,
                                                        class_mode = "binary",
                                                        seed = 2018)

    validation_generator = test_datagen.flow_from_directory(validation_data_dir,
                                                            target_size = (img_height, img_width),
                                                            class_mode = "binary",
                                                            seed = 2018)

    early = EarlyStopping(monitor = 'val_loss', min_delta = 10e-5, patience = 10, verbose = 1, mode = 'auto')
    performance_log = CSVLogger('/mnt/results/vanilla_model_log.csv', separator = ',', append = False)

    # Train the model
    model_final.fit_generator(generator = train_generator,
                              steps_per_epoch = math.ceil(train_generator.samples / batch_size),
                              epochs = epochs,
                              validation_data = validation_generator,
                              validation_steps = math.ceil(validation_generator.samples / batch_size),
                              callbacks = [early, performance_log])

    # Save the model
    model_final.save('/mnt/results/vanilla_model.h5')

修改代码的训练日志:

Epoch 1/100
151/151 [==============================] - 251s 2s/step - loss: 0.6804 - acc: 0.5910 - val_loss: 0.6923 - val_acc: 0.5469
Epoch 2/100
151/151 [==============================] - 87s 578ms/step - loss: 0.6258 - acc: 0.6523 - val_loss: 0.6938 - val_acc: 0.5469
Epoch 3/100
151/151 [==============================] - 88s 580ms/step - loss: 0.5946 - acc: 0.6874 - val_loss: 0.7001 - val_acc: 0.5469
Epoch 4/100
151/151 [==============================] - 88s 580ms/step - loss: 0.5718 - acc: 0.7086 - val_loss: 0.7036 - val_acc: 0.5469
Epoch 5/100
151/151 [==============================] - 87s 578ms/step - loss: 0.5634 - acc: 0.7157 - val_loss: 0.7067 - val_acc: 0.5469
Epoch 6/100
151/151 [==============================] - 87s 578ms/step - loss: 0.5467 - acc: 0.7243 - val_loss: 0.7099 - val_acc: 0.5469
Epoch 7/100
151/151 [==============================] - 87s 578ms/step - loss: 0.5392 - acc: 0.7317 - val_loss: 0.7096 - val_acc: 0.5469
Epoch 8/100
151/151 [==============================] - 87s 578ms/step - loss: 0.5287 - acc: 0.7387 - val_loss: 0.7083 - val_acc: 0.5469
Epoch 9/100
151/151 [==============================] - 87s 575ms/step - loss: 0.5306 - acc: 0.7385 - val_loss: 0.7088 - val_acc: 0.5469
Epoch 10/100
151/151 [==============================] - 87s 577ms/step - loss: 0.5303 - acc: 0.7318 - val_loss: 0.7111 - val_acc: 0.5469
Epoch 11/100
151/151 [==============================] - 87s 578ms/step - loss: 0.5157 - acc: 0.7474 - val_loss: 0.7143 - val_acc: 0.5469

最佳答案

一个可能有用的非常快速的想法。

我认为图像标签是由两个图像数据生成器随机分配并训练的。并且两个图像数据生成器给出了不同的标签分布。这就是训练准确率上升而验证集保持在 50% 左右的原因。

我还没有完全检查数据图像生成器的文档。希望这可能有所帮助。

flow_from_directory() 的参数 classes 描述了一种设置训练标签的方法。

classes: optional list of class subdirectories (e.g. ['dogs', 'cats']). Default: None. If not provided, the list of classes will be automatically inferred from the subdirectory names/structure under directory, where each subdirectory will be treated as a different class (and the order of the classes, which will map to the label indices, will be alphanumeric). The dictionary containing the mapping from class names to class indices can be obtained via the attribute class_indices.

关于python - Keras 验证损失计算不正确，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49116174/

python - Keras 验证损失计算不正确

上一篇：python - 使用 Keras 的 RNN 层的 return_state 输出是什么

下一篇：python - intbitset init 导致 SIGSEGV

python - Keras 验证损失计算不正确

上一篇：python - 使用 Keras 的 RNN 层的 return_state 输出是什么

下一篇：python - intbitset __init__ 导致 SIGSEGV

下一篇：python - intbitset init 导致 SIGSEGV