我在一个非常大的数组数据集数组上拟合神经网络模型(自动编码器),每个嵌套数组的形状为 (1, 100, 4)
。
Train_X.shape
(639936, 1, 100, 4)
从第一个纪元开始,nan
的 loss/val_loss 都出现了损失:Epoch 1/50
511948/511948 [==============================] - 267s 522us/step - loss: nan - acc: 0.5239 - val_loss: nan - val_acc: 0.5235
Epoch 2/50
511948/511948 [==============================] - 272s 530us/step - loss: nan - acc: 0.5234 - val_loss: nan - val_acc: 0.5233
更改了所有超参数值(优化器、学习率等),但没有相同的问题。在进一步检查数据集时,我了解到存在 nan 值,可能是 nan 损失的原因:if np.isnan(Train_X).any():
print(Train_X)
[[[[ 5.66440628e-03 -1.11057350e-02 5.35699731e-03 1.42108547e-14]
[ 4.05186182e-03 -4.71546882e-03 -1.57709147e-03 9.35064891e+01]
[ 3.92575255e-03 -1.45019307e-03 -7.44808370e-04 1.87012978e+02]
...
[ 5.88266444e-03 -7.59219123e-03 2.22257658e-03 8.46522144e-06]
[ 8.78427479e-04 -9.54657321e-04 2.68735736e-04 3.63856117e-06]
[ 4.57741540e-04 0.00000000e+00 2.89454575e-03 4.30687537e-06]]]
[[[ 5.81100709e+00 -6.76592913e-01 -1.31451089e+00 2.66544929e-04]
[ 6.05009120e+00 -6.07611268e-03 -8.90299844e-01 5.74642441e-04]
[ 6.40465738e+00 1.82869833e-01 6.22291158e-02 1.03689017e-03]
...
[ 4.96069986e+00 1.04734007e-01 -2.17030850e-01 7.26117358e-05]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]]]
[[[ nan nan nan 0.00000000e+00]
[ nan nan nan 0.00000000e+00]
[ nan nan nan -1.50999068e-05]
...
[ 5.62468522e-03 4.27860671e-03 -2.06719201e-03 0.00000000e+00]
[ 1.11051478e-02 3.74979015e-03 1.34607852e-03 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]]]]
我还可以通过 Train_X
的第一个条目来确认这一点:Train_X[0]
array([[[ 5.66440628e-03, -1.11057350e-02, 5.35699731e-03,
1.42108547e-14],
[ 4.05186182e-03, -4.71546882e-03, -1.57709147e-03,
9.35064891e+01],
...
[ 7.10669020e-02, 4.91383899e-03, -1.43700407e-02,
1.52228864e-04],
[ 7.59807410e-02, -9.45620170e-03, nan,
1.35892100e-04],
[ 6.65245393e-02, nan, nan,
8.98521456e-05],
[ nan, nan, nan,
1.41090006e-05],
[ nan, nan, nan,
6.68319391e-06],
[ nan, nan, nan,
-3.27272689e+01],
[ nan, nan, nan,
-1.09090911e+01],
[ nan, nan, nan,
8.25973981e+01],
[ nan, nan, nan,
1.12207785e+02],
[ nan, nan, nan,
1.65194797e+02],
[ nan, nan, nan,
2.25974015e+02],
[ nan, nan, nan,
2.78961026e+02],
[ 3.87926649e-03, 1.81274134e-04, -1.08764481e-03,
3.41298685e+02]]])
我想要一种方法来检查存在 nan
的所有值,并将其替换为列的平均值或中位数。如果整列恰好都是 0s
和 nan
,我想从 Train_X 中删除该特定数组。这样我就可以向网络提供不包含任何 nan
的数据集,并查看损失是否从当前状态发生变化。我怎样才能做到这一点?
最佳答案
您可以使用 np.isnan
, np.nanmean
和索引,第二个 x[np.isnan(x)]
是设置所有nan
列归零
x = np.random.randint(0,100,[2,1,4,4]).astype(float)
x[0][0][[0,1,3],[1,2,2]] = float('nan')
x[1][0][[0,1,3],[1,3,2]] = float('nan')
x[0,0,:,1] = float('nan')
x
array([[[[58., nan, 43., 56.],
[88., nan, nan, 69.],
[ 2., nan, 56., 21.],
[65., nan, nan, 23.]]],
[[[96., nan, 86., 19.],
[33., 69., 83., nan],
[93., 21., 7., 2.],
[49., 21., nan, 84.]]]])
x.shape
(2, 1, 4, 4)
columnMean = np.nanmean(x,axis=2) #get the mean value for each column
idc = np.where(np.isnan(x)) # get the indices of nan values
x[np.isnan(x)] = columnMean[idc[0],idc[1],idc[3]] # set nan values to corresponding mean
x[np.isnan(x)] = 0 # set nan columns to zero
x
array([[[[58. , 0. , 43. , 56. ],
[88. , 0. , 49.5 , 69. ],
[ 2. , 0. , 56. , 21. ],
[65. , 0. , 49.5 , 23. ]]],
[[[96. , 37. , 86. , 19. ],
[33. , 69. , 83. , 35. ],
[93. , 21. , 7. , 2. ],
[49. , 21. , 58.66666667, 84. ]]]])
关于python - 替换大型数组数据集中的所有 nan 值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62994628/