using tensorflow 1.14


model, label_length_ts, pred_length_ts, y_true_input_ts= build_model_v1(config["model_input_w"], config["model_input_h"], config["model_input_ch"], class_size, max_str_len)

ctc_loss_prepare_fn = functools.partial(ctc_loss, input_length=pred_length_ts, label_length=label_length_ts, real_y_true_ts=y_true_input_ts)

model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.0001), loss=ctc_loss_prepare_fn)

def build_model_v1(input_width, input_height, input_channels, class_size, max_str_len):

    :param input_width:
    :param input_height:
    :param input_channels:
    :param class_size: including pseudo blank

    input = tf.keras.layers.Input((input_height, input_width, input_channels),name="img_input")

    label_length_input = tf.keras.layers.Input((1,),name="label_length_input")

    pred_length_input = tf.keras.layers.Input((1,),name="pred_length_input")

    y_true_input = tf.keras.layers.Input((max_str_len,), name="y_true_input")

    output = conv_bn_actv(input, 8, (5,5), 1, name="down_0")

    output = tf.keras.layers.MaxPooling2D(name="pool_0")(output)

    output = conv_bn_actv(output, 16, (5,5), 1, name="down_1")

    output = tf.keras.layers.MaxPooling2D(name="pool_1")(output)

    output = conv_bn_actv(output, 32, (3,3), 1, name="down_2")

    output = conv_bn_actv(output, 64, (3,1), 1, name="down_3")


    conv_out_flatten = tf.keras.layers.Reshape((output.shape[2], output.shape[3]))(output)

    output = conv_out_flatten

    # create rnn

    output = tf.keras.layers.CuDNNLSTM(100, return_sequences=True, name="lstm_0")(output)

    output = tf.keras.layers.CuDNNLSTM(100, return_sequences=True, name="lstm_1")(output)

    output = tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(class_size, activation="linear"), input_shape=output.shape, name="timedist_dense")(output)

    y_pred = tf.keras.layers.Softmax()(output)

    model = tf.keras.Model(inputs=[input, pred_length_input, label_length_input, y_true_input],outputs=y_pred)

    return model, label_length_input, pred_length_input, y_true_input
import tensorflow as tf

def ctc_loss(y_true, y_pred, input_length, label_length, real_y_true_ts):

    return tf.keras.backend.ctc_batch_cost(real_y_true_ts, y_pred, input_length, label_length)

The tk.keras.backend.ctc_batch_cost uses tensorflow.python.ops.ctc_ops.ctc_loss functions which has preprocess_collapse_repeated parameter. In some threads, it comments that this parameters should be set to True when the tf.keras.backend.ctc_batch_cost function does not seem to work, such as inconverging loss. However, my experience is that although setting this parameter to True may give the user the illusion that the loss is reducing, it is not actually training the model as the user intends. Please check out the docs on this parameter. For most cases, using the vanilla tf.keras.backend.ctc_batch_cost function is good enough.

Input Sequence / Label Sequence

These two terms that keep appearing documents of ctc related functions and in papers are very confusing. Even once you get the general idea of why those two need to be seperated, there were several confusing moments when determining the shape and type of tensor when coding it. I think what most people, including myself a few days ago, want to know in the end is “exactly what shape/type of tensor do I need to pass on to tf.keras.backend.ctc_batch_cost“?

If we looks at the docs:

  • y_true: tensor (samples, max_string_length) containing the truth labels.
  • y_pred: tensor (samples, time_steps, num_categories) containing the prediction, or output of the softmax.
  • input_length: tensor (samples, 1) containing the sequence length for each batch item in y_pred.
  • label_length: tensor (samples, 1) containing the sequence length for each batch item in y_true.

Here I will explain with an example.

Assume I’m training an CRNN which is what the code presented above is doing. I have a dataset of 6 with images containing texts:

[ “hat”, “cat”, “mouse”, “deer”, “tensorflow”, “good” ]

Assume using batch size of 2, and the output of convolutional layers give 25 sequences, iow 25 time slices that will be feed to the RNN.

Assume the first batch picked [“hat”, “good”].

In this case, the shape of y_true depends on how the user designs the data provision. Since the current batch has max_str_len of 4 (because “good” as four characters), the user can provide y_true to have shape of (2,4). Or since the longest str_len in the whole dataset is 10 (because “tensorflow” has ten characters), the user can provide y_true to have shape of (2,10). As long as the max_string_length used in y_true is same/bigger than the number of chars in the longest word(or label) in the batch, no harm done. So this raises the question: “what should be filled in the ignorable slots in y_true?”. Anything. It doesn’t matter. Put in zeros or -1s or 73839593 if you like. If you take a closer look in the tf.keras.backend.ctc_batch_cost function source code, the y_true and label_length will combine and a sparse tensor will emerge. This process will render the ignorable slots in y_true useless.

y_pred should have shape of (2, 25, class_size). BTW the class_size is “actual class size + 1” where +1 is for pseudo blank.

input_length will have shape of (2,1). But what should its values be? This question was the question that really gave me a hard time. Should it be equal to label_length? or should it be containing the values of the numbers of time slices? If so then isn’t this too obvious since the number of time slices is already available in y_pred? Why is this function requiring me to specify this? …. These are the questions that haunted me.

The answer was the latter. Although it does seem weird, the values of input_length would be [ [25], [25] ] in this example. It is a repetition of the number of time slices(or “sequence”) coming out from RNN.

label_length will have shape of (2,1) and as you might have guessed it, it contains the str_len for each label in the batch. For this example batch, the value will be [ [3], [4] ].

However, the documentation does not mention one of the most important rule when using ctc loss. This is mentioned in the CTC paper.

The RNN sequence length(or “number of time slices” which is 25 in this example) should be larger than ( 2 * max_str_len ) + 1. Here max_str_len if the max_str_len across the entire dataset. Since the max_str_len across the entire dataset in this example is 10(“tensorflow”), and 25 > (2*10+1) is true the ctc loss design is good to go.

If this rule compromised, I’m not sure what side effects will happen but my guess is that the model will only learn to get part of the long label(or word) correct and it will not be able to predict the rest of the long label(or word). Or… perhaps something worse might happen. Haven’t tested it.

CTC Decoding

When predicting, ctc decoding is required. Although this isn’t the most neat way of doing it, here is how it could be roughly done.


pred = model.predict(input_data)

print("pred shape: {}".format(pred.shape))
sequence_length_nparr = np.ones((pred.shape[0],),dtype="int32")
sequence_length_nparr *= 27

print("sequence_length_nparr shape: {}".format(sequence_length_nparr.shape))

# create graph for ctc decoding
batch_size = pred.shape[0]
y_pred_ph = tf.placeholder(tf.float32, shape=pred.shape)
sequence_length_ph = tf.placeholder(tf.int32, shape=(batch_size,))

transposed_pred = tf.transpose(y_pred_ph, perm=[1,0,2])
decoded, log_prob = tf.nn.ctc_beam_search_decoder(transposed_pred, sequence_length_ph,merge_repeated=True)

# print(decoded)

decoded_dense = tf.sparse_tensor_to_dense(decoded[0], default_value=-1)

with tf.Session() as sess:

    decode_output =, feed_dict={
        y_pred_ph: pred,
        sequence_length_ph: sequence_length_nparr

I’m not yet sure if I am doing this right, but the code works. This will need to be tested and studies further.


Joel · May 10, 2020 at 8:36 pm

Dude thanks so much…. I have ran into the exact same problems you have working on speech recognition.

Pedro · July 2, 2020 at 6:37 pm

Thanks! You are just missing the point of input_length, it is not redundant: Rather than the maximum input length (i.e. number of input timesteps, which is 25 in your example), one can provide per-sample input lengths. It is thus not weird, but actually a helpful feature, since it allows for variable length input without masking. In case someone is thinking of using masking instead: Besides the fact that among others Conv2D does not support masking, I doubt that ctc_batch_cost would exploit a mask if one was provided by the softmax layer. I guess it wouldn’t, since it already has that information from input_length.

Ankit Gautam · August 28, 2020 at 11:54 pm

I was stuck in a problem and you article just helped me out..Thanks a lot

Vegas · November 23, 2020 at 5:46 am

Thank man, this article really helped a lot didn’t know where to start implementing the ctc batch loss on training.

Thomas Cook · January 8, 2021 at 9:47 pm

Thanks man! This shall help me get started with my problem

tuan anh nguyen · January 27, 2021 at 12:04 pm

Do you have full test code for speech recognition? on some hundreds wav samples?

    admin · January 28, 2021 at 7:12 pm

    nope. this is a simple demonstration and i have only briefly used ctc loss with text images. i dont have experience working with wav files 🙁

Leave a Reply

Your email address will not be published.