python - 我的文本分类器模型无法通过多个类别得到改进

标签 python pandas tensorflow machine-learning text-classification

我正在尝试训练一个文本分类模型,该模型采用从文章中嵌入的最多 300 个整数的列表。该模型训练没有问题,但准确性不会提高。

目标由 41 个类别组成,编码为 0 到 41 的 int,然后进行标准化。

表格看起来像这样

Table1

此外,我不知道我的模型应该是什么样子,因为我引用了下面的两个不同的示例

  • 具有一个输入列和一个输出列的二元分类器 Example 1
  • 以多列作为输入的多类分类器 Example 2

我尝试根据这两个模型修改我的模型,但模型精度不会改变,甚至每个时期都会降低

我应该在模型中添加更多层还是我做了一些我没有意识到的愚蠢的事情?

注意:如果“df.pickle”下载链接损坏,请使用 this link

from sklearn.model_selection import train_test_split
from urllib.request import urlopen
from os.path import exists
from os import mkdir
import tensorflow as tf
import pandas as pd
import pickle

# Define dataframe path
df_path = 'df.pickle'

# Check if local dataframe exists
if not exists(df_path):
  # Download binary from dropbox
  content = urlopen('https://ucd92a22d5e0d4d29b8edb608305.dl.dropboxusercontent.com/cd/0/get/Askx_25n3JI-jmnZsWXmMmRgd4O2EH1w9l0U6zCMq7xdSXs_IN_i2zuUviseqa9N7-WrReFbGhQi8CeseV5cNsFTO8dzRmSdxjr-MWEDQNpPaZ8Ik29E_58YAjY57qTc4CA/file#').read()

  # Write to file
  with open(df_path, 'wb') as file: file.write(content)

  # Load the dataframe from bytes
  df = pickle.loads(content)
# If the file exists (aka. downloaded)
else:
  # Load the dataframe from file
  df = pickle.load(open(df_path, 'rb'))

# Normalize the category
df['Category_Code'] = df['Category_Code'].apply(lambda x: x / 41)

train_df, test_df = [pd.DataFrame() for _ in range(2)]

x_train, x_test, y_train, y_test = train_test_split(df['Content_Parsed'], df['Category_Code'], test_size=0.15, random_state=8)
train_df['Content_Parsed'], train_df['Category_Code'] = x_train, y_train
test_df['Content_Parsed'], test_df['Category_Code'] = x_test, y_test

# Variable containing the number of words we want to keep in our vocabulary
NUM_WORDS = 10000
# Input/Token length
SEQ_LEN = 300

# Create tokenizer for our data
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=NUM_WORDS, oov_token='<UNK>')
tokenizer.fit_on_texts(train_df['Content_Parsed'])

# Convert text data to numerical indexes
train_seqs=tokenizer.texts_to_sequences(train_df['Content_Parsed'])
test_seqs=tokenizer.texts_to_sequences(test_df['Content_Parsed'])

# Pad data up to SEQ_LEN (note that we truncate if there are more than SEQ_LEN tokens)
train_seqs=tf.keras.preprocessing.sequence.pad_sequences(train_seqs, maxlen=SEQ_LEN, padding="post")
test_seqs=tf.keras.preprocessing.sequence.pad_sequences(test_seqs, maxlen=SEQ_LEN, padding="post")

# Create Models folder if not exists
if not exists('Models'): mkdir('Models')

# Define local model path
model_path = 'Models/model.pickle'

# Check if model exists/pre-trained
if not exists(model_path):
  # Define word embedding size
  EMBEDDING_SIZE = 16

  # Create new model
  '''
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(NUM_WORDS, EMBEDDING_SIZE),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(EMBEDDING_SIZE)),
    # tf.keras.layers.Dense(EMBEDDING_SIZE, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
  ])
  '''
  model = tf.keras.Sequential([
      tf.keras.layers.Embedding(NUM_WORDS, EMBEDDING_SIZE),
      # tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(EMBEDDING_SIZE)),
      tf.keras.layers.GlobalAveragePooling1D(),
      tf.keras.layers.Dense(EMBEDDING_SIZE, activation='relu'),
      tf.keras.layers.Dense(1, activation='sigmoid')
  ])

  # Compile the model
  model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
  )

  # Stop training when a monitored quantity has stopped improving.
  es = tf.keras.callbacks.EarlyStopping(monitor='val_acc', mode='max', patience=1)

  # Define batch size (Can be tuned to improve model accuracy)
  BATCH_SIZE = 16
  # Define number or cycle to train
  EPOCHS = 20

  # Using GPU (If error means you don't have GPU. Use CPU instead)
  with tf.device('/GPU:0'):
    # Train/Fit the model
    history = model.fit(
      train_seqs, 
      train_df['Category_Code'].values, 
      batch_size=BATCH_SIZE, 
      epochs=EPOCHS, 
      validation_split=0.2,
      validation_steps=30,
      callbacks=[es]
    )

  # Evaluate the model
  model.evaluate(test_seqs, test_df['Category_Code'].values)

  # Save the model into a file
  with open(model_path, 'wb') as file: file.write(pickle.dumps(model))
else:
  # Load the model
  model = pickle.load(open(model_path, 'rb'))

# Check the model
model.summary()

最佳答案

经过 2 天的调整和理解更多示例后,我发现了 this该网站很好地解释了多类分类。

我所做的更改详情如下:

  1. 由于我要为多个类构建模型,因此在模型编译期间,模型应使用categorical_crossentropy,因为它是损失函数 而不是 binary_crossentropy

  2. 模型应生成与您要分类的总类别长度相似的输出数量,在我的例子中为41。 (一热编码)

  3. 最后一层的激活函数应该是“softmax”,因为我们选择的是具有最高置信度的标签(最接近1.0)。

  4. 您需要根据要分类的类别数量相应地调整图层。请参阅here关于如何改进模型。

我的最终代码看起来就像这样

from sklearn.model_selection import train_test_split
from urllib.request import urlopen
from functools import reduce
from os.path import exists
from os import listdir
from sys import exit
import tensorflow as tf
import pandas as pd
import pickle
import re

# Specify dataframe path
df_path = 'df.pickle'
# Check if the file exists
if not exists(df_path):
  # Specify url of the dataframe binary
  url = 'https://www.dropbox.com/s/76hibe24hmpz3bk/df.pickle?dl=1'
  # Read the byte content from url
  content = urlopen(url).read()
  # Write to a file to save up time
  with open(df_path, 'wb') as file: file.write(pickle.dumps(content))
  # Unpickle the dataframe
  df = pickle.loads(content)
else:
  # Load the pickle dataframe
  df = pickle.load(open(df_path, 'rb'))

# Useful variables
MAX_NUM_WORDS = 50000                        # Vocabulary size for our tokenizer
MAX_SEQ_LENGTH = 600                         # Maximum length of tokens (for padding later)
EMBEDDING_SIZE = 256                         # Embedding size (Tweak to improve accuracy)
OUTPUT_LENGTH = len(df['Category'].unique()) # Number of class to be classified

# Create our tokenizer
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=MAX_NUM_WORDS, lower=True)
# Fit our tokenizer with words/tokens
tokenizer.fit_on_texts(df['Content_Parsed'].values)
# Get our token vocabulary
word_index = tokenizer.word_index
print('Found {} unique tokens'.format(len(word_index)))

# Parse our text into sequence of numbers using our tokenizer
X = tokenizer.texts_to_sequences(df['Content_Parsed'].values)
# Pad the sequence up to the MAX_SEQ_LENGTH
X = tf.keras.preprocessing.sequence.pad_sequences(X, maxlen=MAX_SEQ_LENGTH)
print('Shape of feature tensor: {}'.format(X.shape))

# Convert our labels into dummy variable (More info on the link provided above)
Y = pd.get_dummies(df['Category']).values
print('Shape of label tensor: {}'.format(Y.shape))

# Split our features and labels into test and train dataset
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.1, random_state=42)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

# Creating our model
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(MAX_NUM_WORDS, EMBEDDING_SIZE, input_length=MAX_SEQ_LENGTH))
model.add(tf.keras.layers.SpatialDropout1D(0.2))
# The number 64 could be changed based on your model performance
model.add(tf.keras.layers.LSTM(64, dropout=0.2, recurrent_dropout=0.2))
# Our output layer with length similar to the OUTPUT_LENGTH
model.add(tf.keras.layers.Dense(OUTPUT_LENGTH, activation='softmax'))
# Compile our model with "categorical_crossentropy" loss function
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Model variables
EPOCHS = 100                          # Number of cycle to run (The early stopping may stop the training process accordingly)
BATCH_SIZE = 64                       # Batch size (Tweaking this may improve model performance a bit)
checkpoint_path = 'model_checkpoints' # Checkpoint path of our model

# Use GPU if available
with tf.device('/GPU:0'):
  # Fit/Train our model
  history = model.fit(
    x_train, y_train,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_split=0.1,
    callbacks=[
      tf.keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0.0001),
      tf.keras.callbacks.ModelCheckpoint(
        checkpoint_path, 
        monitor='val_acc', 
        save_best_only=True, 
        save_weights_only=False
      )
    ],
    verbose=1
  )

现在,我的模型精度表现良好,并且每个时期都在增加,但由于验证精度(val_acc 大约 76~77%)是表现不佳,我可能需要稍微调整模型/图层。

下面提供了输出快照

Output snapshot.png

关于python - 我的文本分类器模型无法通过多个类别得到改进,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58908050/

相关文章:

python - 如何将引号 ""添加到保存为 CSV 的数据帧的非数字列中的每个元素

python - 如何加载预先存在的数据flask-sqlalchemy

python - 如何根据子列表的长度对列表列表进行排序

python - 如何找到数据框中的三个最大值?

python - ffmpeg VS opencv有什么不同的视频分割?

python - 在 Python 中对不同类别的 n 长度数组中的分类数据进行编码

python - 在 pandas 中,向数据帧的子集添加时间偏移量没有任何效果

python-3.x - 在 ANN 模型中加载 pickle 时接收错误

python - 为什么 fit() 期间的训练集准确度与对相同数据使用预测后计算出的准确度不同?

python - Tensorflow:将模型保存到 model.pb,以便稍后可视化