python-3.x - 更改数据表示后维度不匹配 "LabelBinarizer "

标签 python-3.x scikit-learn deep-learning keras image-preprocessing

我有 66 类字符。我训练了一个多层感知器,向其提供图像,每个图像的类别是一个字符

Aa..Zz 0-9 ,;:! è à â ~

我应用了 LabelBinarizer 将每个字符转换为 66 个类别的向量,(例如[0 0 0 0 ....1 ...... 0]) 因为模型不接受非数字数据。 我惊讶地发现 y_test 的维度为 61(错误),y_train(正确的为 66)。 我的代码有什么问题吗?

from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data_pixels, classes_dataset, test_size=0.3)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
print(x_test)
print("my y_train")
print(y_train[0:100])
print("my y_test ")
print(y_test[0:100])

**

(1708, 3072) # x_train shape
(1708,) # y_train shape
(732, 3072) #x_test shape
(732,) y_train shape

**

x_测试

[[ 1.  1.  1. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]]

y_train

 ['5' 'O' '9' '0' 'E' 'D' '9' ',' 'R' 'H' 'T' 'i' '\xc3\xa9' 'T' 'o' 'u' 'R'
     '2' '0' 'L' 'o' '0' '6' 'q' 'P' '6' '2' 'T' '2' '0' 'i' '0' 'f' 'A' 'r'
     'n' 'T' 'O' '8' 'B' 'T' 'd' '0' 'V' 'X' '9' '.' '6' 'J' 'S' 'E' 'O' 'T'
     '4' 'E' '0' 'I' '3' 'o' 'E' '6' 'R' 'M' '0' 'E' '1' 'R' 'T' 'E' '.' '0'
     'G' 'R' 'E' 'E' '0' '9' 'd' '1' '7' 'A' 'B' 'L' '4' 'l' 'O' '1' 'v' '3'
     '%' 'd' '0' 'T' 's' 'A' '6' 'w' 'slash' '2' '9']

y_测试

['N' '0' 'F' '4' 'U' 'C' 'u' 'e' '0' ',' 'G' 'C' 'T' '%' '-' 'V' '5' 'P'
 'N' 'S' '8' '4' ',' 'm' '5' '3' 'e' 'I' 'i' 'M' 'I' '3' 'C' 'F' 'e' 'a'
 '6' 'R' 'V' '4' '0' 'f' '9' 'E' '2' '0' 'E' 'N' 'I' '5' '0' 'A' '%' '-'
 'G' '0' ',' 'O' 'Y' '\xc3\x89' 'R' 's' ',' 'A' 'I' '3' 'S' '2' 'P' '.' 'I'
 ',' 'r' 'I' 'i' '5' '5' 'R' 'C' 'e' '2' 'q' '.' 'R' 'O' 'n' 'S' '6' 'G'
 '0' 'R' 'i' 't' 'i' '9' 'I' 'D' 'slash' '0' 'A']

当我应用LabelBinazer时,我得到y_test维度(732,61)而不是(732,66) 66 表示类数:

from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
y_train = encoder.fit_transform(y_train)
y_test= encoder.fit_transform(y_test)

print(y_train.shape)

print(y_test.shape)

(1708, 66)

(732, 61) # 我为什么得到 61 而不是 66

   **Edit**

按照建议进行更改后

    y_test= encoder.transform(y_test)
l got the following error :
ValueError                                Traceback (most recent call last)
<ipython-input-109-b554d33049ab> in <module>()
      1 y_said=encoder.fit(y_train)
      2 y_said
----> 3 y_test= encoder.transform(y_test)

/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/label.pyc in transform(self, y)
    333                               pos_label=self.pos_label,
    334                               neg_label=self.neg_label,
--> 335                               sparse_output=self.sparse_output)
    336 
    337     def inverse_transform(self, Y, threshold=None):

/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/label.pyc in label_binarize(y, classes, neg_label, pos_label, sparse_output)
    492     if (y_type == "multilabel-indicator" and classes.size != y.shape[1]):
    493         raise ValueError("classes {0} missmatch with the labels {1}"
--> 494                          "found in the data".format(classes, unique_labels(y)))
    495 
    496     if y_type in ("binary", "multiclass"):

ValueError: classes [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65] missmatch with the labels [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
 50 51 52 53 54 55 56 57 58 59 60]found in the data

编辑2:

encoder = LabelBinarizer()
encoder.fit(y_train + y_test)
y_train= encoder.transform(y_train)  
y_test= encoder.transform(y_test)

我收到以下错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-125-45c4512ecc1f> in <module>()
      1 from sklearn.preprocessing import LabelBinarizer
      2 encoder = LabelBinarizer()
----> 3 encoder.fit(y_train + y_test)
      4 y_train= encoder.transform(y_train)
      5 y_test= encoder.transform(y_test)

TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('S5') dtype('S5') dtype('S5')

编辑3: 我来自 jupyter 的代码:

from __future__ import print_function
from keras.models import Sequential
from keras.layers import Dense
import keras
import numpy as  np
# fix random seed for reproducibility
numpy.random.seed(7)
batch_size = 128
num_classes = 66
epochs = 12

data_pixels=np.genfromtxt("pixels_dataset.csv", delimiter=',')

classes_dataset=np.genfromtxt("labels.csv",dtype=np.str , delimiter='\t')

from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data_pixels, classes_dataset, test_size=0.3)

from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
encoder.fit(y_train + y_test)
y_train= encoder.transform(y_train)  
y_test= encoder.transform(y_test)

最佳答案

您需要使用适合训练数据的encoder转换您的y_test。所以你需要将代码更改为:

 y_test= encoder.transform(y_test)

否则您的编码器将被重新安装,从而导致问题。

编辑:

您也可以尝试:

encoder.fit(y_train + y_test) # in case of lists
encoder.fit(numpy.append(y_train, y_test)) # in case of numpy arrays
y_train= encoder.transform(y_train)  
y_test= encoder.transform(y_test)

编辑2:

最终成功的工作是:

encoder.fit(numpy.append(y_train, y_test)) # in case of numpy arrays
y_train= encoder.fit_transform(y_train)  
y_test= encoder.transform(y_test)

关于python-3.x - 更改数据表示后维度不匹配 "LabelBinarizer ",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43211848/

相关文章:

python - 在 python 3 中散列数组或对象

python - Sklearn set_params 正好接受 1 个参数?

machine-learning - 将 GridSearchCV 与 TimeSeriesSplit 结合使用

machine-learning - 我应该先执行交叉验证然后再进行网格搜索吗?

tensorflow - 如何从 ImageDataGenerator 获取 x_train 和 y_train?

tensorflow - 将预训练模型生成的预测输出解码为人类可读的标签

python - 如何修复 "RuntimeError: The current Numpy installation fails to pass a sanity check due to a bug in the windows runtime."

python - 根据引用其他 DataFrame 的值的索引复制列中的值

python - 如何将 CNN 图像中的输入形状从 40x40 更改为 13x78?

python - Pygame,如何在屏幕上绘制形状并删除之前的表面?