python - 了解 CTC 的 TF 实现如何工作

我正在尝试了解 CTC 实现在 TensorFlow 中的工作原理。我写了一个快速示例只是为了测试 CTC 功能，但由于某种原因，我对某些目标/输入值进行了 inf 处理，我确定为什么会发生这种情况!？

代码:

import tensorflow as tf
import numpy as np

# https://github.com/philipperemy/tensorflow-ctc-speech-recognition/blob/master/utils.py
def sparse_tuple_from(sequences, dtype=np.int32):
    """Create a sparse representention of x.
    Args:
        sequences: a list of lists of type dtype where each element is a sequence
    Returns:
        A tuple with (indices, values, shape)
    """
    indices = []
    values = []

    for n, seq in enumerate(sequences):
        indices.extend(zip([n] * len(seq), range(len(seq))))
        values.extend(seq)

    indices = np.asarray(indices, dtype=np.int64)
    values = np.asarray(values, dtype=dtype)
    shape = np.asarray([len(sequences), np.asarray(indices).max(0)[1] + 1], dtype=np.int64)

    return indices, values, shape

batch_size = 1
seq_length = 2
n_labels = 2

seq_len = tf.placeholder(tf.int32, [None])
targets = tf.sparse_placeholder(tf.int32)
logits = tf.constant(np.random.random((batch_size, seq_length, n_labels+1)),dtype=tf.float32) # +1 for the blank label
loss = tf.reduce_mean(tf.nn.ctc_loss(targets, logits, seq_len, time_major = False))


with tf.Session() as sess:
    for it in range(10):
        rand_target = np.random.randint(n_labels, size=(seq_length))
        sample_target = sparse_tuple_from([rand_target])

        logitsval = sess.run(logits)
        lossval = sess.run(loss, feed_dict={seq_len: [seq_length], targets: sample_target})
        print('******* Iter: %d *******'%it)
        print('logits:', logitsval)
        print('rand_target:', rand_target)
        print('rand_sparse_target:', sample_target)
        print('loss:', lossval)
        print()

示例输出:

******* Iter: 0 *******
logits: [[[ 0.10151503  0.88581538  0.56466645]
  [ 0.76043415  0.52718711  0.01166286]]]
rand_target: [0 1]
rand_sparse_target: (array([[0, 0],
       [0, 1]]), array([0, 1], dtype=int32), array([1, 2]))
loss: 2.61521

******* Iter: 1 *******
logits: [[[ 0.10151503  0.88581538  0.56466645]
  [ 0.76043415  0.52718711  0.01166286]]]
rand_target: [1 1]
rand_sparse_target: (array([[0, 0],
       [0, 1]]), array([1, 1], dtype=int32), array([1, 2]))
loss: inf

******* Iter: 2 *******
logits: [[[ 0.10151503  0.88581538  0.56466645]
  [ 0.76043415  0.52718711  0.01166286]]]
rand_target: [0 1]
rand_sparse_target: (array([[0, 0],
       [0, 1]]), array([0, 1], dtype=int32), array([1, 2]))
loss: 2.61521

******* Iter: 3 *******
logits: [[[ 0.10151503  0.88581538  0.56466645]
  [ 0.76043415  0.52718711  0.01166286]]]
rand_target: [1 0]
rand_sparse_target: (array([[0, 0],
       [0, 1]]), array([1, 0], dtype=int32), array([1, 2]))
loss: 1.59766

******* Iter: 4 *******
logits: [[[ 0.10151503  0.88581538  0.56466645]
  [ 0.76043415  0.52718711  0.01166286]]]
rand_target: [0 0]
rand_sparse_target: (array([[0, 0],
       [0, 1]]), array([0, 0], dtype=int32), array([1, 2]))
loss: inf

******* Iter: 5 *******
logits: [[[ 0.10151503  0.88581538  0.56466645]
  [ 0.76043415  0.52718711  0.01166286]]]
rand_target: [0 1]
rand_sparse_target: (array([[0, 0],
       [0, 1]]), array([0, 1], dtype=int32), array([1, 2]))
loss: 2.61521

******* Iter: 6 *******
logits: [[[ 0.10151503  0.88581538  0.56466645]
  [ 0.76043415  0.52718711  0.01166286]]]
rand_target: [1 0]
rand_sparse_target: (array([[0, 0],
       [0, 1]]), array([1, 0], dtype=int32), array([1, 2]))
loss: 1.59766

******* Iter: 7 *******
logits: [[[ 0.10151503  0.88581538  0.56466645]
  [ 0.76043415  0.52718711  0.01166286]]]
rand_target: [1 1]
rand_sparse_target: (array([[0, 0],
       [0, 1]]), array([1, 1], dtype=int32), array([1, 2]))
loss: inf

******* Iter: 8 *******
logits: [[[ 0.10151503  0.88581538  0.56466645]
  [ 0.76043415  0.52718711  0.01166286]]]
rand_target: [0 1]
rand_sparse_target: (array([[0, 0],
       [0, 1]]), array([0, 1], dtype=int32), array([1, 2]))
loss: 2.61521

******* Iter: 9 *******
logits: [[[ 0.10151503  0.88581538  0.56466645]
  [ 0.76043415  0.52718711  0.01166286]]]
rand_target: [0 0]
rand_sparse_target: (array([[0, 0],
       [0, 1]]), array([0, 0], dtype=int32), array([1, 2]))
loss: inf

知道我错过了什么吗!？

最佳答案

仔细查看您的输入文本(rand_target)，我确信您会看到一些与 inf 损失值相关的简单模式;-)

对正在发生的事情的简短解释: CTC 通过允许重复每个字符来对文本进行编码，并且还允许在字符之间插入非字符标记(称为“CTC 空白标签”)。撤销这种编码(或解码)仅仅意味着丢弃重复的字符，然后丢弃所有空白。举一些例子(“...”对应文本，“...”对应编码，“-”对应空白标签):

“to”->“tttooo”、“t-o”、“t-oo”、“to”等...
"too"-> 'to-o'，或 'tttoo---oo'，或 '---t-o-o--'，但不是 'too'(想想解码后的 'too' 会是什么样子)

现在我们已经足够了解为什么您的一些示例失败了:

您输入的文本长度为2
编码长度为2
如果输入字符重复(例如“11”，或作为 python 列表:[1, 1])，那么对其进行编码的唯一方法是在中间放置一个空格(考虑大量解码“11”)和“1-1”)。但编码的长度将为 3。
因此，无法将具有重复字符的长度为 2 的文本编码为长度为 2 的编码，因此 TF 损失实现返回 inf

您还可以将编码想象为状态机 - 请参见下图。文本“11”可以由从开始状态(最左边的两个状态)开始到最终状态(最右边的两个状态)结束的所有可能路径来表示。正如您所看到的，最短的可能路径是“1-1”。

总而言之，您必须考虑为输入文本中的每个重复字符至少插入一个额外的空格。也许这篇文章有助于理解 CTC:https://towardsdatascience.com/3797e43a86c

关于python - 了解 CTC 的 TF 实现如何工作，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52543267/

python - 了解 CTC 的 TF 实现如何工作

上一篇：python - CartoPy 中绘制的等高线的插值方法

下一篇：python - 比较 Pandas 中两个数据帧的列差异