具有数据增强功能的 Keras ImageDataGenerator sample_weight

我有一个关于在使用 ImageDataGenerator 的 Keras 数据增强上下文中使用 sample_weight 参数的问题。假设我有一系列简单的图像，只有一类对象。因此，对于每个图像，我将有一个相应的掩码，其中像素 = 0 用于背景，1 用于标记对象的位置。

然而，这个数据集是不平衡的，因为这些图像中有大量是空的，这意味着掩码只包含 0。
如果我理解得很好，ImageDataGenerator 的 flow 方法的 'sample_weight' 参数在这里将重点放在我觉得更有趣的数据集样本上，即我的对象所在的位置。

我的问题是:这个sample_weight参数对我的模型训练有什么具体影响。它会影响数据增强吗？如果我使用 'validation_split' 参数，它会影响生成验证集的方式吗？

这是我的问题所指的代码部分:

data_gen_args = dict(rotation_range=90,
                     width_shift_range=0.4,
                     height_shift_range=0.4,
                     zoom_range=0.4,
                     horizontal_flip=True,
                     fill_mode='reflect',
                     rescale=1. / 255,
                     validation_split=0.2,
                     data_format='channels_last'
)    

image_datagen = ImageDataGenerator(**data_gen_args)


imf = image_datagen.flow(
    x=stacked_images_channel,
    y=stacked_masks_channel,
    batch_size=batch_size,
    shuffle=False,
    seed=seed,subset='training',
    sample_weight = sample_weight,
    save_to_dir = 'traindir',
    save_prefix = 'train_'
)

valf = image_datagen.flow(
    x=stacked_images_channel,
    y=stacked_masks_channel,
    batch_size=batch_size,
    shuffle=False,
    seed=seed,subset='validation',
    sample_weight = sample_weight,
    save_to_dir = 'valdir',
    save_prefix = 'val_'
)

STEP_SIZE_TRAIN=imf.n//imf.batch_size
STEP_SIZE_VALID=valf.n//valf.batch_size

model = unet.UNet2(numberOfClasses, imshape, '', learningRate, depth=4)

history = model.fit_generator(generator=imf,
                    steps_per_epoch=STEP_SIZE_TRAIN,
                    epochs=epochs,
                    validation_data=valf,
                    validation_steps=STEP_SIZE_VALID,
                    verbose=2
)

预先感谢您的关注。

最佳答案

至于在 1.1.0 进行预处理的 Keras 2.2.5，sample_weight与 sample 一起传递并在加工过程中应用。打电话时.fit_generator ，模型是批量训练的，each batch using sample weights :

model.train_on_batch(x, y,
                     sample_weight=sample_weight,
                     class_weight=class_weight)

在.train_on_batch的源代码中, documentation states :“sample_weight:与 x 长度相同的可选数组，包含应用于每个样本的模型损失的权重。(...)”。权重的实际应用发生在计算每个批次的损失时。在编译模型时，Keras 会根据所需的损失函数生成“加权损失”函数。加权计算在 code 中说明作为:

def weighted(y_true, y_pred, weights, mask=None):
        """Wrapper function.
        # Arguments
            y_true: `y_true` argument of `fn`.
            y_pred: `y_pred` argument of `fn`.
            weights: Weights tensor.
            mask: Mask tensor.
        # Returns
            Scalar tensor.
        """
        # score_array has ndim >= 2
        score_array = fn(y_true, y_pred)
        if mask is not None:
            # Cast the mask to floatX to avoid float64 upcasting in Theano
            mask = K.cast(mask, K.floatx())
            # mask should have the same shape as score_array
            score_array *= mask
            #  the loss per batch should be proportional
            #  to the number of unmasked samples.
            score_array /= K.mean(mask) + K.epsilon()

        # apply sample weighting
        if weights is not None:
            # reduce score_array to same ndim as weight array
            ndim = K.ndim(score_array)
            weight_ndim = K.ndim(weights)
            score_array = K.mean(score_array,
                                 axis=list(range(weight_ndim, ndim)))
            score_array *= weights
            score_array /= K.mean(K.cast(K.not_equal(weights, 0), K.floatx()))
        return K.mean(score_array)

这个包装器显示它首先计算所需的损失(调用 fn(y_true, y_pred) )，然后如果权重通过(使用 sample_weight 或 class_weight )应用权重。
考虑到这种情况:

what is the concrete influence of this sample_weight parameter on the training of my model.

权重基本上乘以损失(并归一化)。因此“重”权重(超过 1 个)样本会导致更多损失，因此梯度更大。 “轻”权重降低了样本的重要性并导致更小的梯度。

Does it influence the data augmentation?

这取决于你的意思。这是我从经验中可以说的，我在提供 Keras 数据生成器之前执行增强(这样做是因为预处理中存在问题，据我所知在预处理 1.1.0 中仍然存在):

当向生成器提供已经增强的数据时，.flow只要输入数据，调用就需要一个样本权重列表。因此，权重对增强的影响取决于如何选择权重。一个数据点增强 N 次可以为每个增强分配相同的权重，或者根据意图分配 1/N。

Keras 中的默认行为似乎为 Keras 执行的每个增强(转换)分配了相同的权重。 code看起来很清楚，虽然我从来没有依赖过它。

If I use the 'validation_split' parameter, does it influence the way validation sets are generated?

sample_weight参数似乎不会干扰 validation_split .我没有专门研究代码，但拆分基本上是获取输入数据，并保留一个用于验证的拆分——无论数据是什么。当sample_weight添加后，每个数据点有什么变化:没有权重，数据为(x, y) ;加上重量，数据变成(x, y, weight) .

关于具有数据增强功能的 Keras ImageDataGenerator sample_weight，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55061774/

具有数据增强功能的 Keras ImageDataGenerator sample_weight

上一篇：使用 R 检索期刊论文的引用

下一篇：javascript - 每个脚本文件只能有一个匿名定义调用