tensorflow - 使用 Tensorflow 的 Connectionist 时间分类 (CTC) 实现

标签 tensorflow speech-recognition end-to-end ctc

我试图在 contrib 包(tf.contrib.ctc.ctc_loss)下使用 Tensorflow 的 CTC 实现,但没有成功。

  • 首先,有人知道我在哪里可以阅读一个好的分步教程吗? Tensorflow 的文档在这个主题上很差。
  • 我是否必须向 ctc_loss 提供带有交错空白标签的标签?
  • 即使使用长度为 1 的训练数据集超过 200 个 epoch,我也无法过度拟合我的网络。 :(
  • 如何使用 tf.edit_distance 计算标签错误率?

  • 这是我的代码:

    with graph.as_default():
    
      max_length = X_train.shape[1]
      frame_size = X_train.shape[2]
      max_target_length = y_train.shape[1]
    
      # Batch size x time steps x data width
      data = tf.placeholder(tf.float32, [None, max_length, frame_size])
      data_length = tf.placeholder(tf.int32, [None])
    
      #  Batch size x max_target_length
      target_dense = tf.placeholder(tf.int32, [None, max_target_length])
      target_length = tf.placeholder(tf.int32, [None])
    
      #  Generating sparse tensor representation of target
      target = ctc_label_dense_to_sparse(target_dense, target_length)
    
      # Applying LSTM, returning output for each timestep (y_rnn1, 
      # [batch_size, max_time, cell.output_size]) and the final state of shape
      # [batch_size, cell.state_size]
      y_rnn1, h_rnn1 = tf.nn.dynamic_rnn(
        tf.nn.rnn_cell.LSTMCell(num_hidden, state_is_tuple=True, num_proj=num_classes), #  num_proj=num_classes
        data,
        dtype=tf.float32,
        sequence_length=data_length,
      )
    
      #  For sequence labelling, we want a prediction for each timestamp. 
      #  However, we share the weights for the softmax layer across all timesteps. 
      #  How do we do that? By flattening the first two dimensions of the output tensor. 
      #  This way time steps look the same as examples in the batch to the weight matrix. 
      #  Afterwards, we reshape back to the desired shape
    
    
      # Reshaping
      logits = tf.transpose(y_rnn1, perm=(1, 0, 2))
    
      #  Get the loss by calculating ctc_loss
      #  Also calculates
      #  the gradient.  This class performs the softmax operation for you, so    inputs
      #  should be e.g. linear projections of outputs by an LSTM.
      loss = tf.reduce_mean(tf.contrib.ctc.ctc_loss(logits, target, data_length))
    
      #  Define our optimizer with learning rate
      optimizer = tf.train.RMSPropOptimizer(learning_rate).minimize(loss)
    
      #  Decoding using beam search
      decoded, log_probabilities = tf.contrib.ctc.ctc_beam_search_decoder(logits, data_length, beam_width=10, top_paths=1)
    

    谢谢!

    更新 (06/29/2016)

    谢谢你,@jihyeon-seo!所以,我们在 RNN 的输入上有类似 [num_batch, max_time_step, num_features] 的东西。我们使用 dynamic_rnn 执行给定输入的循环计算,输出一个形状为 [num_batch, max_time_step, num_hidden] 的张量。之后,我们需要在每个 tilmestep 中使用权重共享进行仿射投影,因此我们必须 reshape 为 [num_batch*max_time_step, num_hidden],乘以形状为 [num_hidden, num_classes] 的权重矩阵,求和偏置撤消reshape, transpose(所以我们将有 [max_time_steps, num_batch, num_classes] 用于 ctc loss 输入),这个结果将是 ctc_loss 函数的输入。我做的一切正确吗?

    这是代码:

        cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)
    
        h_rnn1, self.last_state = tf.nn.dynamic_rnn(cell, self.input_data, self.sequence_length, dtype=tf.float32)
    
        #  Reshaping to share weights accross timesteps
        x_fc1 = tf.reshape(h_rnn1, [-1, num_hidden])
    
        self._logits = tf.matmul(x_fc1, self._W_fc1) + self._b_fc1
    
        #  Reshaping
        self._logits = tf.reshape(self._logits, [max_length, -1, num_classes])
    
        #  Calculating loss
        loss = tf.contrib.ctc.ctc_loss(self._logits, self._targets, self.sequence_length)
    
        self.cost = tf.reduce_mean(loss)
    

    更新 (07/11/2016)

    谢谢@Xiv。这是修复错误后的代码:

        cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)
    
        h_rnn1, self.last_state = tf.nn.dynamic_rnn(cell, self.input_data, self.sequence_length, dtype=tf.float32)
    
        #  Reshaping to share weights accross timesteps
        x_fc1 = tf.reshape(h_rnn1, [-1, num_hidden])
    
        self._logits = tf.matmul(x_fc1, self._W_fc1) + self._b_fc1
    
        #  Reshaping
        self._logits = tf.reshape(self._logits, [-1, max_length, num_classes])
        self._logits = tf.transpose(self._logits, (1,0,2))
    
        #  Calculating loss
        loss = tf.contrib.ctc.ctc_loss(self._logits, self._targets, self.sequence_length)
    
        self.cost = tf.reduce_mean(loss)
    

    更新 (07/25/16)

    published在我的代码的 GitHub 部分,使用一个话语。随意使用! :)

    最佳答案

    我正在尝试做同样的事情。
    以下是我发现您可能感兴趣的内容。

    很难找到 CTC 的教程,但是 this example was helpful

    而对于空白标签 CTC layer assumes that the blank index is num_classes - 1 ,您需要为空白标签提供一个额外的类。

    此外,CTC 网络执行 softmax 层。在您的代码中,RNN 层连接到 CTC 损失层。 RNN层的输出是内部激活的,所以你需要再添加一个没有激活功能的隐藏层(可能是输出层),然后添加CTC损失层。

    关于tensorflow - 使用 Tensorflow 的 Connectionist 时间分类 (CTC) 实现,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38059247/

    相关文章:

    Tensorflow:如何关闭Tensorboard服务器

    Android 将语音语言更改为文本到日语不起作用

    angularjs - 如何在 Protractor 测试中单击选择框中的选项?

    python - TensorFlow 变量和常量

    python - 如何使用Tensorflow v1.1 seq2seq.dynamic_decode?

    java - Sphinx4 将 wav 文件中的语音识别为输入而不是麦克风输入

    python - 如何通过pyspeech或dragonfly输入和处理音频文件以转换为文本

    javascript - 我如何保存 div 文本以便在其他地方使用?

    javascript - Protractor 未知错误,从 DOM 中删除属性

    python - 基本多层感知器中优化参数的问题