我的模型在 CPU 机器上训练得很好,但是当我尝试在我们的集群上重新运行它时遇到了问题(使用单个 GPU 和相同的数据集)。当在 GPU 机器上训练时,验证损失和准确性并没有从一个时期到另一个时期提高(见下文)。在 CPU 机器上情况并非如此(我能够在 20 个时期后达到 ~0.8 的验证准确性)
详细信息:
喀拉斯 2.1.3
TensforFlow 后端
70/20/10 训练/开发/测试
~ 7000 张图片
模型基于ResNet50
代码
import sys
import math
import os
import glob
from keras import applications
from keras.preprocessing.image import ImageDataGenerator
from keras import optimizers
from keras.models import Sequential, Model
from keras.layers import Flatten, Dense
from keras import backend as k
from keras.callbacks import ModelCheckpoint, CSVLogger, EarlyStopping
############ Training parameters ##################
img_width, img_height = 224, 224
batch_size = 32
epochs = 100
############ Define the data ##################
train_data_dir = '/mnt/data/train'
validation_data_dir = '/mnt/data/validate'
train_data_dir_class1 = os.path.join(train_data_dir,'class1', '*.jpg')
train_data_dir_class2 = os.path.join(train_data_dir, 'class2', '*.jpg')
validation_data_dir_class1 = os.path.join(validation_data_dir, 'class1', '*.jpg')
validation_data_dir_class2 = os.path.join(validation_data_dir, 'class2', '*.jpg')
# number of training and validation samples
nb_train_samples = len(glob.glob(train_data_dir_class1)) + len(glob.glob(train_data_dir_class2))
nb_validation_samples = len(glob.glob(validation_data_dir_class1)) + len(glob.glob(validation_data_dir_class2))
############ Define the model ##################
model = applications.resnet50.ResNet50(weights = "imagenet",
include_top = False,
input_shape = (img_width, img_height, 3))
for layer in model.layers:
layer.trainable = False
# Adding a FC layer
x = model.output
x = Flatten()(x)
predictions = Dense(1, activation = "sigmoid")(x)
# creating the final model
model_final = Model(inputs = model.input, outputs = predictions)
# compile the model
model_final.compile(loss = "binary_crossentropy",
optimizer = optimizers.Adam(lr = 0.001,
beta_1 = 0.9,
beta_2 = 0.999,
epsilon = 1e-10),
metrics = ["accuracy"])
# train and test generators
train_datagen = ImageDataGenerator(rescale = 1./255,
horizontal_flip = True,
fill_mode = "nearest",
zoom_range = 0.3,
width_shift_range = 0.3,
height_shift_range = 0.3,
rotation_range = 30)
test_datagen = ImageDataGenerator(rescale = 1./255)
train_generator = train_datagen.flow_from_directory(train_data_dir,
target_size = (img_height, img_width),
batch_size = batch_size,
class_mode = "binary",
seed = 2018)
validation_generator = test_datagen.flow_from_directory(validation_data_dir,
target_size = (img_height, img_width),
class_mode = "binary",
seed = 2018)
early = EarlyStopping(monitor = 'val_loss', min_delta = 10e-5, patience = 10, verbose = 1, mode = 'auto')
performance_log = CSVLogger('/mnt/results/vanilla_model_log.csv', separator = ',', append = False)
# Train the model
model_final.fit_generator(generator = train_generator,
steps_per_epoch = math.ceil(train_generator.samples / batch_size),
epochs = epochs,
validation_data = validation_generator,
validation_steps = math.ceil(validation_generator.samples / batch_size),
callbacks = [early, performance_log])
# Save the model
model_final.save('/mnt/results/vanilla_model.h5')
训练日志
Epoch 1/100
151/151 [==============================] - 237s 2s/step - loss: 0.7234 - acc: 0.5240 - val_loss: 0.9899 - val_acc: 0.5425
Epoch 2/100
151/151 [==============================] - 65s 428ms/step - loss: 0.6491 - acc: 0.6228 - val_loss: 1.0248 - val_acc: 0.5425
Epoch 3/100
151/151 [==============================] - 65s 429ms/step - loss: 0.6091 - acc: 0.6648 - val_loss: 1.0377 - val_acc: 0.5425
Epoch 4/100
151/151 [==============================] - 64s 426ms/step - loss: 0.5829 - acc: 0.6968 - val_loss: 1.0459 - val_acc: 0.5425
Epoch 5/100
151/151 [==============================] - 64s 427ms/step - loss: 0.5722 - acc: 0.7070 - val_loss: 1.0472 - val_acc: 0.5425
Epoch 6/100
151/151 [==============================] - 64s 427ms/step - loss: 0.5582 - acc: 0.7166 - val_loss: 1.0501 - val_acc: 0.5425
Epoch 7/100
151/151 [==============================] - 64s 424ms/step - loss: 0.5535 - acc: 0.7188 - val_loss: 1.0492 - val_acc: 0.5425
Epoch 8/100
151/151 [==============================] - 64s 426ms/step - loss: 0.5377 - acc: 0.7287 - val_loss: 1.0209 - val_acc: 0.5425
Epoch 9/100
151/151 [==============================] - 64s 425ms/step - loss: 0.5328 - acc: 0.7368 - val_loss: 1.0062 - val_acc: 0.5425
Epoch 10/100
151/151 [==============================] - 65s 432ms/step - loss: 0.5296 - acc: 0.7381 - val_loss: 1.0016 - val_acc: 0.5425
Epoch 11/100
151/151 [==============================] - 65s 430ms/step - loss: 0.5231 - acc: 0.7419 - val_loss: 1.0021 - val_acc: 0.5425
由于我能够在 CPU 机器上获得良好的结果,我假设验证损失/准确性必须在每个时期结束时计算不正确。为了测试这个理论,我使用训练集作为验证集:如果验证损失/准确度计算正确,我们应该看到大致相同的训练和验证损失和准确度值。正如您在下面看到的,验证损失值与训练损失值不同,这让我相信验证损失在每个时期结束时计算不正确。 为什么会这样?有哪些可能的解决方案?
修改后的代码
import sys
import math
import os
import glob
from keras import applications
from keras.preprocessing.image import ImageDataGenerator
from keras import optimizers
from keras.models import Sequential, Model
from keras.layers import Flatten, Dense
from keras import backend as k
from keras.callbacks import ModelCheckpoint, CSVLogger, EarlyStopping
############ Training parameters ##################
img_width, img_height = 224, 224
batch_size = 32
epochs = 100
############ Define the data ##################
train_data_dir = '/mnt/data/train'
validation_data_dir = '/mnt/data/train' # redefined validation set to test accuracy of validation loss/accuracy calculations
train_data_dir_class1 = os.path.join(train_data_dir,'class1', '*.jpg')
train_data_dir_class2 = os.path.join(train_data_dir, 'class2', '*.jpg')
validation_data_dir_class1 = os.path.join(validation_data_dir, 'class1', '*.jpg')
validation_data_dir_class2 = os.path.join(validation_data_dir, 'class2', '*.jpg')
# number of training and validation samples
nb_train_samples = len(glob.glob(train_data_dir_class1)) + len(glob.glob(train_data_dir_class2))
nb_validation_samples = len(glob.glob(validation_data_dir_class1)) + len(glob.glob(validation_data_dir_class2))
############ Define the model ##################
model = applications.resnet50.ResNet50(weights = "imagenet",
include_top = False,
input_shape = (img_width, img_height, 3))
for layer in model.layers:
layer.trainable = False
# Adding a FC layer
x = model.output
x = Flatten()(x)
predictions = Dense(1, activation = "sigmoid")(x)
# creating the final model
model_final = Model(inputs = model.input, outputs = predictions)
# compile the model
model_final.compile(loss = "binary_crossentropy",
optimizer = optimizers.Adam(lr = 0.001,
beta_1 = 0.9,
beta_2 = 0.999,
epsilon = 1e-10),
metrics = ["accuracy"])
# train and test generators
train_datagen = ImageDataGenerator(rescale = 1./255,
horizontal_flip = True,
fill_mode = "nearest",
zoom_range = 0.3,
width_shift_range = 0.3,
height_shift_range = 0.3,
rotation_range = 30)
test_datagen = ImageDataGenerator(rescale = 1./255)
train_generator = train_datagen.flow_from_directory(train_data_dir,
target_size = (img_height, img_width),
batch_size = batch_size,
class_mode = "binary",
seed = 2018)
validation_generator = test_datagen.flow_from_directory(validation_data_dir,
target_size = (img_height, img_width),
class_mode = "binary",
seed = 2018)
early = EarlyStopping(monitor = 'val_loss', min_delta = 10e-5, patience = 10, verbose = 1, mode = 'auto')
performance_log = CSVLogger('/mnt/results/vanilla_model_log.csv', separator = ',', append = False)
# Train the model
model_final.fit_generator(generator = train_generator,
steps_per_epoch = math.ceil(train_generator.samples / batch_size),
epochs = epochs,
validation_data = validation_generator,
validation_steps = math.ceil(validation_generator.samples / batch_size),
callbacks = [early, performance_log])
# Save the model
model_final.save('/mnt/results/vanilla_model.h5')
修改代码的训练日志:
Epoch 1/100
151/151 [==============================] - 251s 2s/step - loss: 0.6804 - acc: 0.5910 - val_loss: 0.6923 - val_acc: 0.5469
Epoch 2/100
151/151 [==============================] - 87s 578ms/step - loss: 0.6258 - acc: 0.6523 - val_loss: 0.6938 - val_acc: 0.5469
Epoch 3/100
151/151 [==============================] - 88s 580ms/step - loss: 0.5946 - acc: 0.6874 - val_loss: 0.7001 - val_acc: 0.5469
Epoch 4/100
151/151 [==============================] - 88s 580ms/step - loss: 0.5718 - acc: 0.7086 - val_loss: 0.7036 - val_acc: 0.5469
Epoch 5/100
151/151 [==============================] - 87s 578ms/step - loss: 0.5634 - acc: 0.7157 - val_loss: 0.7067 - val_acc: 0.5469
Epoch 6/100
151/151 [==============================] - 87s 578ms/step - loss: 0.5467 - acc: 0.7243 - val_loss: 0.7099 - val_acc: 0.5469
Epoch 7/100
151/151 [==============================] - 87s 578ms/step - loss: 0.5392 - acc: 0.7317 - val_loss: 0.7096 - val_acc: 0.5469
Epoch 8/100
151/151 [==============================] - 87s 578ms/step - loss: 0.5287 - acc: 0.7387 - val_loss: 0.7083 - val_acc: 0.5469
Epoch 9/100
151/151 [==============================] - 87s 575ms/step - loss: 0.5306 - acc: 0.7385 - val_loss: 0.7088 - val_acc: 0.5469
Epoch 10/100
151/151 [==============================] - 87s 577ms/step - loss: 0.5303 - acc: 0.7318 - val_loss: 0.7111 - val_acc: 0.5469
Epoch 11/100
151/151 [==============================] - 87s 578ms/step - loss: 0.5157 - acc: 0.7474 - val_loss: 0.7143 - val_acc: 0.5469
最佳答案
一个可能有用的非常快速的想法。
我认为图像标签是由两个图像数据生成器随机分配并训练的。 并且两个图像数据生成器给出了不同的标签分布。 这就是训练准确率上升而验证集保持在 50% 左右的原因。
我还没有完全检查数据图像生成器的文档。希望这可能有所帮助。
flow_from_directory() 的参数 classes 描述了一种设置训练标签的方法。
classes: optional list of class subdirectories (e.g. ['dogs', 'cats']). Default: None. If not provided, the list of classes will be automatically inferred from the subdirectory names/structure under directory, where each subdirectory will be treated as a different class (and the order of the classes, which will map to the label indices, will be alphanumeric). The dictionary containing the mapping from class names to class indices can be obtained via the attribute class_indices.
关于python - Keras 验证损失计算不正确,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49116174/