python - tensorflow /keras神经网络中的过拟合和数据泄漏

标签 python neural-network anaconda spyder figure

早上好,我是机器学习和神经网络领域的新手。我正在尝试构建一个完全连接的神经网络来解决回归问题。该数据集由18个特征和1个标签组成,这些都是物理量。

您可以找到下面的代码。我上传了损失函数沿历元演化的图(您可以在下面找到它)。我不确定是否存在过度拟合。有人可以解释一下为什么会出现过拟合吗?

import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn import preprocessing

from sklearn.model_selection import train_test_split

from matplotlib import pyplot as plt

import keras
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping
from keras import optimizers
from sklearn.metrics import r2_score
from keras import regularizers
from keras import backend
from tensorflow.keras import regularizers
from keras.regularizers import l2

# =============================================================================
# Scelgo il test size
# =============================================================================
test_size = 0.2

dataset = pd.read_csv('DataSet.csv', decimal=',', delimiter = ";")

label = dataset.iloc[:,-1]
features = dataset.drop(columns = ['Label'])

y_max_pre_normalize = max(label)
y_min_pre_normalize = min(label)

def denormalize(y):
    final_value = y*(y_max_pre_normalize-y_min_pre_normalize)+y_min_pre_normalize
    return final_value

# =============================================================================
# Split
# =============================================================================

X_train1, X_test1, y_train1, y_test1 = train_test_split(features, label, test_size = test_size, shuffle = True)

y_test2 = y_test1.to_frame()
y_train2 = y_train1.to_frame()

# =============================================================================
# Normalizzo
# =============================================================================
scaler1 = preprocessing.MinMaxScaler()
scaler2 = preprocessing.MinMaxScaler()
X_train = scaler1.fit_transform(X_train1)
X_test = scaler2.fit_transform(X_test1)


scaler3 = preprocessing.MinMaxScaler()
scaler4 = preprocessing.MinMaxScaler()
y_train = scaler3.fit_transform(y_train2)
y_test = scaler4.fit_transform(y_test2)



# =============================================================================
# Creo la rete
# =============================================================================
optimizer = tf.keras.optimizers.Adam(lr=0.001)
model = Sequential()

model.add(Dense(60, input_shape = (X_train.shape[1],), activation = 'relu',kernel_initializer='glorot_uniform'))
model.add(Dropout(0.2))
model.add(Dense(60, activation = 'relu',kernel_initializer='glorot_uniform'))
model.add(Dropout(0.2))
model.add(Dense(60, activation = 'relu',kernel_initializer='glorot_uniform'))

model.add(Dense(1,activation = 'linear',kernel_initializer='glorot_uniform'))

model.compile(loss = 'mse', optimizer = optimizer, metrics = ['mse'])

history = model.fit(X_train, y_train, epochs = 100,
                    validation_split = 0.1, shuffle=True, batch_size=250
                    )

history_dict = history.history

loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

y_train_pred = denormalize(y_train_pred)
y_test_pred = denormalize(y_test_pred)


plt.figure()
plt.plot((y_test1),(y_test_pred),'.', color='darkviolet', alpha=1, marker='o', markersize = 2, markeredgecolor = 'black', markeredgewidth = 0.1)
plt.plot((np.array((-0.1,7))),(np.array((-0.1,7))),'-', color='magenta')
plt.xlabel('True')
plt.ylabel('Predicted')
plt.title('Test')

plt.figure()
plt.plot((y_train1),(y_train_pred),'.', color='darkviolet', alpha=1, marker='o', markersize = 2, markeredgecolor = 'black', markeredgewidth = 0.1)
plt.plot((np.array((-0.1,7))),(np.array((-0.1,7))),'-', color='magenta')
plt.xlabel('True')
plt.ylabel('Predicted')
plt.title('Train')

plt.figure()
plt.plot(loss_values,'b',label = 'training loss')
plt.plot(val_loss_values,'r',label = 'val training loss')
plt.xlabel('Epochs')
plt.ylabel('Loss Function')
plt.legend()

print("\n\nThe R2 score on the test set is:\t{:0.3f}".format(r2_score(y_test_pred, y_test1)))

print("The R2 score on the train set is:\t{:0.3f}".format(r2_score(y_train_pred, y_train1)))
from sklearn import metrics

# Measure MSE error.  
score = metrics.mean_squared_error(y_test_pred,y_test1)
print("\n\nFinal score test (MSE): %0.4f" %(score))
score1 = metrics.mean_squared_error(y_train_pred,y_train1)
print("Final score train (MSE): %0.4f" %(score1))
score2 = np.sqrt(metrics.mean_squared_error(y_test_pred,y_test1))
print(f"Final score test (RMSE): %0.4f" %(score2))
score3 = np.sqrt(metrics.mean_squared_error(y_train_pred,y_train1))
print(f"Final score train (RMSE): %0.4f" %(score3))

enter image description here

编辑:

我还尝试进行特征重要性并提高 n_epochs,结果如下:

功能重要性:

enter image description here

没有功能重要性:

enter image description here

最佳答案

看来您没有过度拟合!您的训练和验证曲线一起下降并收敛。过度拟合的最明显迹象是这两条曲线之间的偏差,如下所示:overfitting ecample

由于您的两条曲线呈下降趋势并且没有发散,因此表明您的神经网络训练是健康的。

但是!您的验证曲线可疑地低于训练曲线。这暗示可能存在数据泄漏(训练数据和测试数据已以某种方式混合)。有关精彩短片的更多信息 blog post 。一般来说,您应该在进行任何其他预处理(标准化、扩充、改组等)之前分割数据

造成这种情况的其他原因可能是某种类型的正则化(dropout、BN 等),该正则化在计算训练精度时处于事件状态,而在计算验证/测试精度时处于停用状态。

关于python - tensorflow /keras神经网络中的过拟合和数据泄漏,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59856614/

相关文章:

python - 使用GPU导入theano时出错

python - os.path.abspath的一个 super 奇怪的bug

Python 2.6 不喜欢附加到 zip 文件中的现有文件

python - 如何从 Python 中的函数返回两个值?

python - Keras 预测每次返回相同的结果

python - 无法导入安装在 python 中的库

python - 如何用pyserial打开串口?

machine-learning - 使用 TensorFlow 进行非线性回归,结果呈直线

tensorflow - LSTM 或任何其他层的 TimeDistributed 包装器有什么用途

python - Conda - 离线安装/更新