目前我正在处理使用 Keras 训练图像数据时的大数据问题。我有包含一批 .npy 文件的目录。每批包含 512 张图像。每个批次都有其对应的标签文件为 .npy。所以它看起来像:{image_file_1.npy, label_file_1.npy, ..., image_file_37.npy, label_file_37}。每个图像文件都有维度 (512, 199, 199, 3)
, 每个标签文件都有维度 (512, 1)
(1 或 0)。如果我将所有图像加载到一个 ndarray 中,它将是 35+ GB。到目前为止阅读了所有 Keras Doc。我仍然无法找到如何使用自定义生成器进行训练。我读过 flow_from_dict
和 ImageDataGenerator(...).flow()
但在这种情况下它们并不理想,或者我不知道如何定制它们。这是我所做的。
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.optimizers import SGD
from keras.preprocessing.image import ImageDataGenerator
val_gen = ImageDataGenerator(rescale=1./255)
x_test = np.load("../data/val_file.npy")
y_test = np.load("../data/val_label.npy")
val_gen.fit(x_test)
model = Sequential()
...
model_1.add(layers.Dense(512, activation='relu'))
model_1.add(layers.Dense(1, activation='sigmoid'))
model.compile(loss='categorical_crossentropy',
optimizer=sgd,
metrics=['acc'])
model.fit_generator(generate_batch_from_directory() # should give 1 image file and 1 label file
validation_data=val_gen.flow(x_test,
y_test,
batch_size=64),
validation_steps=32)
所以这里
generate_batch_from_directory()
应取image_file_i.npy
和 label_file_i.npy
每次都优化权重,直到没有批次为止。 .npy 文件中的每个图像数组都已经过增强、旋转和缩放处理。每个.npy
文件与类 1 和类 0 (50/50) 的数据正确混合。如果我附加所有批次并创建一个大文件,例如:
X_train = np.append([image_file_1, ..., image_file_37])
y_train = np.append([label_file_1, ..., label_file_37])
它不适合内存。否则我可以使用
.flow()
生成图像集来训练模型。感谢您的任何建议。
最佳答案
最后我能够解决这个问题。但我必须阅读 keras.utils.Sequence
的源代码和文档。构建我自己的生成器类。 This document对理解 Kears 中生成器的工作原理有很大帮助。您可以在我的 kaggle notebook 中阅读更多详细信息:
all_files_loc = "datapsycho/imglake/population/train/image_files/"
all_files = os.listdir(all_files_loc)
image_label_map = {
"image_file_{}.npy".format(i+1): "label_file_{}.npy".format(i+1)
for i in range(int(len(all_files)/2))}
partition = [item for item in all_files if "image_file" in item]
class DataGenerator(keras.utils.Sequence):
def __init__(self, file_list):
"""Constructor can be expanded,
with batch size, dimentation etc.
"""
self.file_list = file_list
self.on_epoch_end()
def __len__(self):
'Take all batches in each iteration'
return int(len(self.file_list))
def __getitem__(self, index):
'Get next batch'
# Generate indexes of the batch
indexes = self.indexes[index:(index+1)]
# single file
file_list_temp = [self.file_list[k] for k in indexes]
# Set of X_train and y_train
X, y = self.__data_generation(file_list_temp)
return X, y
def on_epoch_end(self):
'Updates indexes after each epoch'
self.indexes = np.arange(len(self.file_list))
def __data_generation(self, file_list_temp):
'Generates data containing batch_size samples'
data_loc = "datapsycho/imglake/population/train/image_files/"
# Generate data
for ID in file_list_temp:
x_file_path = os.path.join(data_loc, ID)
y_file_path = os.path.join(data_loc, image_label_map.get(ID))
# Store sample
X = np.load(x_file_path)
# Store class
y = np.load(y_file_path)
return X, y
# ====================
# train set
# ====================
all_files_loc = "datapsycho/imglake/population/train/image_files/"
all_files = os.listdir(all_files_loc)
training_generator = DataGenerator(partition)
validation_generator = ValDataGenerator(val_partition) # work same as training generator
hst = model.fit_generator(generator=training_generator,
epochs=200,
validation_data=validation_generator,
use_multiprocessing=True,
max_queue_size=32)
关于python - 使用生成器从成批的 .npy 文件训练 Keras 模型?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53788434/