python - 为什么神经网络在自己的训练数据上预测错误?

标签 python tensorflow machine-learning keras neural-network

我制作了一个带有监督学习的 LSTM (RNN) 神经网络,用于数据存量预测。问题是为什么它在自己的训练数据上预测错误? (注意: 可重现示例 下面)

我创建了一个简单的模型来预测 future 5 天的股价:

model = Sequential()
model.add(LSTM(32, activation='sigmoid', input_shape=(x_train.shape[1], x_train.shape[2])))
model.add(Dense(y_train.shape[1]))
model.compile(optimizer='adam', loss='mse')

es = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
model.fit(x_train, y_train, batch_size=64, epochs=25, validation_data=(x_test, y_test), callbacks=[es])

正确的结果在 y_test (5 个值),所以模型训练,回顾前 90 天,然后从最好的 ( val_loss=0.0030 ) 结果恢复权重 patience=3 :
Train on 396 samples, validate on 1 samples
Epoch 1/25
396/396 [==============================] - 1s 2ms/step - loss: 0.1322 - val_loss: 0.0299
Epoch 2/25
396/396 [==============================] - 0s 402us/step - loss: 0.0478 - val_loss: 0.0129
Epoch 3/25
396/396 [==============================] - 0s 397us/step - loss: 0.0385 - val_loss: 0.0178
Epoch 4/25
396/396 [==============================] - 0s 399us/step - loss: 0.0398 - val_loss: 0.0078
Epoch 5/25
396/396 [==============================] - 0s 391us/step - loss: 0.0343 - val_loss: 0.0030
Epoch 6/25
396/396 [==============================] - 0s 391us/step - loss: 0.0318 - val_loss: 0.0047
Epoch 7/25
396/396 [==============================] - 0s 389us/step - loss: 0.0308 - val_loss: 0.0043
Epoch 8/25
396/396 [==============================] - 0s 393us/step - loss: 0.0292 - val_loss: 0.0056

预测结果非常棒,不是吗?

enter image description here

那是因为算法从#5 epoch 恢复了最佳权重。好的,现在让我们将此模型保存到 .h5文件,向后移动 -10 天并预测过去 5 天(在第一个示例中,我们在 4 月 17 日至 23 日(包括周末休息日)制作模型并进行验证,现在让我们在 4 月 2 日至 8 日进行测试)。结果:

enter image description here

它显示了绝对错误的方向。正如我们所看到的,这是因为模型在 4 月 17 日至 23 日进行了训练并在第 5 个时期进行了最佳验证,但在 2-8 日没有。如果我尝试更多训练,选择哪个时期进行比赛,无论我做什么,过去总会有很多时间间隔错误预测。

为什么模型在自己的训练数据上显示错误的结果?我训练了数据,它一定记得如何在这块集合上预测数据,但预测错误。我也尝试过:
  • 使用具有 50k+ 行、20 年股票价格的大型数据集,添加或多或少的特征
  • 创建不同类型的模型,例如添加更多隐藏层、不同的batch_sizes、不同的层激活、dropout、batchnormalization
  • 创建自定义 EarlyStopping 回调,从许多验证数据集中获取平均 val_loss 并选择最佳

  • 也许我错过了什么?我可以改进什么?

    这里很简单可重现 例子。 yfinance下载标准普尔 500 指数股票数据。
    """python 3.7.7
    tensorflow 2.1.0
    keras 2.3.1"""
    
    
    import numpy as np
    import pandas as pd
    from keras.callbacks import EarlyStopping, Callback
    from keras.models import Model, Sequential, load_model
    from keras.layers import Dense, Dropout, LSTM, BatchNormalization
    from sklearn.preprocessing import MinMaxScaler
    import plotly.graph_objects as go
    import yfinance as yf
    np.random.seed(4)
    
    
    num_prediction = 5
    look_back = 90
    new_s_h5 = True # change it to False when you created model and want test on other past dates
    
    
    df = yf.download(tickers="^GSPC", start='2018-05-06', end='2020-04-24', interval="1d")
    data = df.filter(['Close', 'High', 'Low', 'Volume'])
    
    # drop last N days to validate saved model on past
    df.drop(df.tail(0).index, inplace=True)
    print(df)
    
    
    class EarlyStoppingCust(Callback):
        def __init__(self, patience=0, verbose=0, validation_sets=None, restore_best_weights=False):
            super(EarlyStoppingCust, self).__init__()
            self.patience = patience
            self.verbose = verbose
            self.wait = 0
            self.stopped_epoch = 0
            self.restore_best_weights = restore_best_weights
            self.best_weights = None
            self.validation_sets = validation_sets
    
        def on_train_begin(self, logs=None):
            self.wait = 0
            self.stopped_epoch = 0
            self.best_avg_loss = (np.Inf, 0)
    
        def on_epoch_end(self, epoch, logs=None):
            loss_ = 0
            for i, validation_set in enumerate(self.validation_sets):
                predicted = self.model.predict(validation_set[0])
                loss = self.model.evaluate(validation_set[0], validation_set[1], verbose = 0)
                loss_ += loss
                if self.verbose > 0:
                    print('val' + str(i + 1) + '_loss: %.5f' % loss)
    
            avg_loss = loss_ / len(self.validation_sets)
            print('avg_loss: %.5f' % avg_loss)
    
            if self.best_avg_loss[0] > avg_loss:
                self.best_avg_loss = (avg_loss, epoch + 1)
                self.wait = 0
                if self.restore_best_weights:
                    print('new best epoch = %d' % (epoch + 1))
                    self.best_weights = self.model.get_weights()
            else:
                self.wait += 1
                if self.wait >= self.patience or self.params['epochs'] == epoch + 1:
                    self.stopped_epoch = epoch
                    self.model.stop_training = True
                    if self.restore_best_weights:
                        if self.verbose > 0:
                            print('Restoring model weights from the end of the best epoch')
                        self.model.set_weights(self.best_weights)
    
        def on_train_end(self, logs=None):
            print('best_avg_loss: %.5f (#%d)' % (self.best_avg_loss[0], self.best_avg_loss[1]))
    
    
    def multivariate_data(dataset, target, start_index, end_index, history_size, target_size, step, single_step=False):
        data = []
        labels = []
        start_index = start_index + history_size
        if end_index is None:
            end_index = len(dataset) - target_size
        for i in range(start_index, end_index):
            indices = range(i-history_size, i, step)
            data.append(dataset[indices])
            if single_step:
                labels.append(target[i+target_size])
            else:
                labels.append(target[i:i+target_size])
        return np.array(data), np.array(labels)
    
    
    def transform_predicted(pr):
        pr = pr.reshape(pr.shape[1], -1)
        z = np.zeros((pr.shape[0], x_train.shape[2] - 1), dtype=pr.dtype)
        pr = np.append(pr, z, axis=1)
        pr = scaler.inverse_transform(pr)
        pr = pr[:, 0]
        return pr
    
    
    step = 1
    
    # creating datasets with look back
    scaler = MinMaxScaler()
    df_normalized = scaler.fit_transform(df.values)
    dataset = df_normalized[:-num_prediction]
    x_train, y_train = multivariate_data(dataset, dataset[:, 0], 0,len(dataset) - num_prediction + 1, look_back, num_prediction, step)
    indices = range(len(dataset)-look_back, len(dataset), step)
    x_test = np.array(dataset[indices])
    x_test = np.expand_dims(x_test, axis=0)
    y_test = np.expand_dims(df_normalized[-num_prediction:, 0], axis=0)
    
    # creating past datasets to validate with EarlyStoppingCust
    number_validates = 50
    step_past = 5
    validation_sets = [(x_test, y_test)]
    for i in range(1, number_validates * step_past + 1, step_past):
        indices = range(len(dataset)-look_back-i, len(dataset)-i, step)
        x_t = np.array(dataset[indices])
        x_t = np.expand_dims(x_t, axis=0)
        y_t = np.expand_dims(df_normalized[-num_prediction-i:len(df_normalized)-i, 0], axis=0)
        validation_sets.append((x_t, y_t))
    
    
    if new_s_h5:
        model = Sequential()
        model.add(LSTM(32, return_sequences=False, activation = 'sigmoid', input_shape=(x_train.shape[1], x_train.shape[2])))
        # model.add(Dropout(0.2))
        # model.add(BatchNormalization())
        # model.add(LSTM(units = 16))
        model.add(Dense(y_train.shape[1]))
        model.compile(optimizer = 'adam', loss = 'mse')
    
        # EarlyStoppingCust is custom callback to validate each validation_sets and get average
        # it takes epoch with best "best_avg" value
        # es = EarlyStoppingCust(patience = 3, restore_best_weights = True, validation_sets = validation_sets, verbose = 1)
    
        # or there is keras extension with built-in EarlyStopping, but it validates only 1 set that you pass through fit()
        es = EarlyStopping(monitor = 'val_loss', patience = 3, restore_best_weights = True)
    
        model.fit(x_train, y_train, batch_size = 64, epochs = 25, shuffle = True, validation_data = (x_test, y_test), callbacks = [es])
        model.save('s.h5')
    else:
        model = load_model('s.h5')
    
    
    
    predicted = model.predict(x_test)
    predicted = transform_predicted(predicted)
    print('predicted', predicted)
    print('real', df.iloc[-num_prediction:, 0].values)
    print('val_loss: %.5f' % (model.evaluate(x_test, y_test, verbose=0)))
    
    
    fig = go.Figure()
    fig.add_trace(go.Scatter(
        x = df.index[-60:],
        y = df.iloc[-60:,0],
        mode='lines+markers',
        name='real',
        line=dict(color='#ff9800', width=1)
    ))
    fig.add_trace(go.Scatter(
        x = df.index[-num_prediction:],
        y = predicted,
        mode='lines+markers',
        name='predict',
        line=dict(color='#2196f3', width=1)
    ))
    fig.update_layout(template='plotly_dark', hovermode='x', spikedistance=-1, hoverlabel=dict(font_size=16))
    fig.update_xaxes(showspikes=True)
    fig.update_yaxes(showspikes=True)
    fig.show()
    

    最佳答案

    OP 假设了一个有趣的发现。让我将原来的问题简化如下。

    如果模型是在特定时间序列上训练的,为什么模型不能重建之前已经训练过的时间序列数据?

    嗯,答案就嵌入在训练进度本身中。自 EarlyStopping这里用来避免过拟合,最好的模型保存在epoch=5 ,其中 val_loss=0.0030正如OP所提到的。在这种情况下,训练损失等于 0.0343 ,即训练的RMSE为0.185 .由于数据集是使用 MinMaxScalar 缩放的,我们需要撤消 RMSE 的缩放以了解发生了什么。

    发现时间序列的最小值和最大值是22903380 .因此,有 0.185因为训练的 RMSE 意味着,即使对于训练集,预测值也可能与真实值相差大约 0.185*(3380-2290) ,即 ~200平均单位。

    这解释了为什么在前一个时间步预测训练数据本身时存在很大差异。

    我应该怎么做才能完美模拟训练数据?

    这个问题是我自己问的。简单的答案是,使训练损失接近 0 ,这就是过度拟合模型。

    经过一些训练,我意识到只有 1 个 LSTM 层的模型具有 32细胞不够复杂,无法重建训练数据。因此,我添加了另一个 LSTM 层,如下所示。

    model = Sequential()
    model.add(LSTM(32, return_sequences=True, activation = 'sigmoid', input_shape=(x_train.shape[1], x_train.shape[2])))
    # model.add(Dropout(0.2))
    # model.add(BatchNormalization())
    model.add(LSTM(units = 64, return_sequences=False,))
    model.add(Dense(y_train.shape[1]))
    model.compile(optimizer = 'adam', loss = 'mse')
    

    模型针对 1000 进行训练不考虑时代 EarlyStopping .
    model.fit(x_train, y_train, batch_size = 64, epochs = 1000, shuffle = True, validation_data = (x_test, y_test))
    

    1000第 epoch 我们的训练损失为 0.00047这远低于您的情况下的训练损失。所以我们希望模型能够更好地重建训练数据。以下是 4 月 2-8 日的预测图。

    prediction

    最后说明:

    在特定数据库上训练并不一定意味着模型应该能够完美地重建训练数据。特别是当引入提前停止、正则化和 dropout 等方法来避免过拟合时,模型往往更具泛化性,而不是内存训练数据。

    关于python - 为什么神经网络在自己的训练数据上预测错误?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61425296/

    相关文章:

    python - mac上安装Tensorflow出错

    python - sklearn.model_selection.StratifiedShuffleSplit 中 n_splits 的用途是什么?

    python - 如何仅打印 Pandas 数据框的索引?

    python - DisabledFunctionError : cv2. imshow() 在 Colab 中被禁用,因为它会导致 Jupyter session 崩溃

    python - 构建后 session 图为空

    python - Tensorflow 训练损失在不同的运行中具有宏观相似性

    machine-learning - scikit learn - 如何对由多个独立值组成的单个特征进行建模

    Python 语法 : x = SomeClass(some_input)(x)

    tensorflow - Tensorflow 2.0.0-alpha 0 版本中已弃用函数的问题

    python - Keras 中的分割网络在训练期间收敛到单个类