python - Keras 数据集的训练集和测试集具有不同的向量长度

标签 python list keras dataset numpy-ndarray

我正在尝试使用 keras.datasets 中的路透社和 imdb 数据集。标准调用是:

(x_train, y_train), (x_test, y_test) = imdb.load_data(path="imdb.npz",

当我检查维度时,训练数据集给出 (25000, 10922),这很有意义。但测试给出(25000,)。如果您转储单个测试数据集元素(例如 x_test[0]),它会给出一个列表而不是 numpy.array。问题是每行的列表维度都会变化,并且总是与训练向量维度不同。您应该如何使用它作为测试数据?


嗯,正如您所提到的,x_trainx_test 中的每个元素都是一个列表。该列表包含句子(或段落或在本例中为评论)的单词索引,并且由于句子可能具有不同数量的单词,因此该相应的表示也具有可变长度。让我们解码其中一个句子,看看它是什么样子,并更加熟悉数据集:

from keras.datasets import imdb

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)

# a mapping from words to their indices, for example `human`: 403
word_index = imdb.get_word_index()

# create the reverse mapping i.e. from indices to words
rev_word_index = {idx:w for w,idx in word_index.items()}

def decode_sentence(s):
    # index 0 to 2 are reserved for things like padding, unknown word, etc.
    decoded_sent = [rev_word_index.get(idx-3, '[RES]') for idx in s]
    return ' '.join(decoded_sent)



[RES] i am a great fan of david lynch and have everything that he's made on dvd except for hotel room the 2 hour twin peaks movie so when i found out about this i immediately grabbed it and and what is this it's a bunch of [RES] drawn black and white cartoons that are loud and foul mouthed and unfunny maybe i don't know what's good but maybe this is just a bunch of crap that was [RES] on the public under the name of david lynch to make a few bucks too let me make it clear that i didn't care about the foul language part but had to keep [RES] the sound because my neighbors might have all in all this is a highly disappointing release and may well have just been left in the [RES] box set as a curiosity i highly recommend you don't spend your money on this 2 out of 10

此数据的使用完全取决于您以及您要解决的问题。您可以将句子按原样输入到可以处理可变长度句子的网络中。它通常由一维卷积层或 LSTM 层或两者的混合组成。另一种方法是将所有句子编码为固定长度编码,使所有句子具有相同的长度。下面是一个例子,one-hot 将每个句子编码为一个由 0 和 1 组成的向量,所有句子的长度都是固定的:

from keras.datasets import imdb
import numpy as np

# you can limit the vocabulary size by passing `num_words` argument 
# to ignore rare words and make data more manageable
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)

def encode_sentences(sentences, dim):
    encodings = np.zeros((len(sentences), dim))
    for idx, sentence in enumerate(sentences):
        encodings[idx, sentence] = 1.0
    return encodings

x_train = encode_sentences(x_train, 10000)
x_test = encode_sentences(x_test, 10000)



(25000, 10000)
(25000, 10000)

所有句子都被编码为长度为 10000 的向量,其中该向量的第 i 个元素指示索引为 i 的单词是否出现在相应的句子中。


from keras.datasets import imdb
from keras import preprocessing

n_feats = 10000   # maximum number of words we want to consider in our vocabulary
max_len = 500     # maximum length of each sentence (i.e. truncate those longer
                  # than 500 words and pad those shorter than 500 words)

# you can limit the vocabulary size by passing `num_words` argument
# to ignore rare words and make it more manageable
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=n_feats)

# preprocess the sequences (i.e. truncate or pad)
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=max_len)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=max_len)



(25000, 500)
(25000, 500)

现在所有 25000 个句子的长度都相同,可以使用了。

我强烈建议阅读the Keras documentation on this dataset .

关于python - Keras 数据集的训练集和测试集具有不同的向量长度,我们在Stack Overflow上找到一个类似的问题:


keras - 二维输入上的 Conv1D

python - 在 Python 中减去 2 个列表

python - 行组的 Numpy 点积

Python网页爬取BeautifulSoup : getting both text and links

java - 将节点插入链表

c# - 在 JavaScript/Jquery/NodeJS 中创建类似 C# 的 List

python - 多个向量对的 Numpy 和点积 : how can it be done?

python - 如何替换 txt 文件中已存在的数据 --python

python - 创建具有多个输入的 TimeseriesGenerator

apache-spark - 在pyspark UDF中使用tensorflow.keras模型会产生pickle错误