This article discusses handwritten character recognition (OCR) in images using sequence-to-sequence (seq2seq) mapping performed by a Convolutional Recurrent Neural Network (CRNN) trained with Connectionist Temporal Classification (CTC) loss. The aforementioned approach is employed in multiple modern OCR engines for handwritten text (e.g., Google’s Keyboard App - convolutions are replaced with Bezier interpolations) or typed text (e.g., Tesseract 4’s CRNN Based Recognition Module).
For the sake of simplicity, the example I’ll be presenting performs only digit recognition but can be easily extended to accommodate more classes of characters.
The overall source code for this project is quite long so I’m providing a Google Colab document that includes a working example.
Previous Inadequacies and Justification
“Why not simply segment characters in the image and recognize them one by one?”
While the approach is, indeed, more straightforward and has been incorporated in older OCR engines, it has its caveats, especially when considering handwritten text. These are caused by the imperfections of the written characters which can produce segmentation issues thus attempting to recognize invalid glyphs or symbols. Consider the following images for clarification:
Whereas the MNIST problem is considered solved thus implying that reliable classifiers can be constructed to individually recognize digits, the problem of correct segmentation still remains in realistic scenarios. Splitting or merging glyphs to form valid digits proves to be a difficult challenge and requires additional knowledge to be embedded into the segmentation module.
In this context, the main advantage brought by a seq2seq classifier is that it diminishes the impact of erroneous segmentations and takes advantage of the ability of a neural network to generalize. It only requires a valid segmentation of the word or text line in cause.
Consider the following simplistic model that has a sliding window or mask (no convolutions), of size
(1, img_height). Each set of pixels covered by the sliding window is fed into a neural network made out of neurons with memory (e.g., GRU or LSTM); the job of the neural network is to take a sequence of such stripes and output recognized digits. Take a look at the following figure:
Multiple digits will be included in a single sequence - because we’re feeding the network an image which contains more than a digit. It is up to the neural network to determine during the training phase how many stripes to take into account when classifying a digit (i.e., how much to memorize). The image below illustrates how a RNN should ‘group’ stripes together in order to recognize each digit in the sequence.
Using this method, it is possible to train a neural network by simply saying that the image above contains the numbers ‘55207’, without further information (e.g.: alignment, delimitations, bounding boxes etc.)
CTC and Duplicates Removal
CTC loss is most commonly employed to train seq2seq RNNs. It works by summing the probabilities for all possible alignments; the probability of an alignment is determined by multiplying the probabilities of having specific digits in certain slots. An alignment can be seen as a plausible sequence of recognized digits.
Going back to the ‘55207’ example, we can express the probability of the alignment as follows:
To properly remove duplicates and also correctly handle numbers that contain repeating digits, the blank class is introduced, with the following rules:
- 2 (or more) repeating digits are collapsed into a single instance of that digit unless separated by blank - this compensates for the fact that the RNN performs a classification for each stripe that represents a part of a digit (thus producing duplicates)
- multiple consecutive blanks are collapsed into one blank - this compensates for the spacing before, after or between the digits
Given these aspects, there are multiple alignments that, once collapsed, lead to the correct alignment (‘55207’).
For example: 55-55222–07 once collapsed leads to ‘55207’ and suggests the correct sequence even though it has a different structure because of additional duplicates and blanks (marked as ‘-’ here). The probability of this alignment () is computed as previously shown but it also includes the probabilities of the blank class:
Finally, the CTC probability of a sequence is calculated, as previously mentioned, by summing the probabilities for all different alignments:
When training, the neural network attempts to maximize this probability for the sequence provided as ground truth.
A decoding method is used to recover the text from a set of digits probabilities; a naive approach would be to pick, for each slot in the alignment, the digits with the highest probability and the collapse the result. This approach is easier to implement and might be enough for this example although beam search (i.e.: greedy approach that picks first N digits with highest probabilities, instead of only one) is employed for such tasks in larger projects.
Including Convolutional Layers
Implementing convolutions in the previously described model simply implies that raw pixel information is replaced, in the input of the RNN, with higher level features. In PyTorch, the output of the convolution layers must be reshaped to the time sequence format
(batch_size, sequence_length, gru_input_size).
In the current project, the output of the convolution part has the following shape:
(batch_size, num_channels, convolved_img_height, convolved_img_width). I’m permuting the tensor to
(batch_size, convolved_img_width, convolved_img_height, num_channels) and then reshaping the last 2 dimensions into one which becomes
To avoid additional steps such as image preprocessing, segmentation and class balancing I picked a more friendly dataset: EMNIST for digits. The following helper script randomly picks digits from the dataset, applies affine augmentations and concatenates them into sequences of a given length.
A LeNet-5 based convolution model is employed, with the following modifications:
- 5x5 filters are replaced with 2 consecutive 3x3 filters
- max-pooling is replaced with strided convolutions
The resulted higher level features are fed into a Bi-GRU RNN with a linear layer in the end that generated the required number of classes (9 + 1). I’ve chosen GRU over LSTM since it had similar results but required fewer resources. A
log_softmax activation function is used.
The text is decoded using a simple best path algorithm.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 class CRNN(nn.Module): def __init__(self): super(CRNN, self).__init__() self.num_classes = 10 + 1 self.image_H = 28 self.conv1 = nn.Conv2d(1, 32, kernel_size=(3,3)) self.in1 = nn.InstanceNorm2d(32) self.conv2 = nn.Conv2d(32, 32, kernel_size=(3,3)) self.in2 = nn.InstanceNorm2d(32) self.conv3 = nn.Conv2d(32, 32, kernel_size=(3,3), stride=2) self.in3 = nn.InstanceNorm2d(32) self.conv4 = nn.Conv2d(32, 64, kernel_size=(3,3)) self.in4 = nn.InstanceNorm2d(64) self.conv5 = nn.Conv2d(64, 64, kernel_size=(3,3)) self.in5 = nn.InstanceNorm2d(64) self.conv6 = nn.Conv2d(64, 64, kernel_size=(3,3), stride=2) self.in6 = nn.InstanceNorm2d(64) self.postconv_height = 3 self.postconv_width = 31 self.gru_input_size = self.postconv_height * 64 self.gru_hidden_size = 128 self.gru_num_layers = 2 self.gru_h = None self.gru_cell = None self.gru = nn.GRU(self.gru_input_size, self.gru_hidden_size, self.gru_num_layers, batch_first = True, bidirectional = True) self.fc = nn.Linear(self.gru_hidden_size * 2, self.num_classes) def forward(self, x): batch_size = x.shape out = self.conv1(x) out = F.leaky_relu(out) out = self.in1(out) out = self.conv2(out) out = F.leaky_relu(out) out = self.in2(out) out = self.conv3(out) out = F.leaky_relu(out) out = self.in3(out) out = self.conv4(out) out = F.leaky_relu(out) out = self.in4(out) out = self.conv5(out) out = F.leaky_relu(out) out = self.in5(out) out = self.conv6(out) out = F.leaky_relu(out) out = self.in6(out) out = out.permute(0, 3, 2, 1) out = out.reshape(batch_size, -1, self.gru_input_size) out, gru_h = self.gru(out, self.gru_h) self.gru_h = gru_h.detach() out = torch.stack([F.log_softmax(self.fc(out[i])) for i in range(out.shape)]) return out def reset_hidden(self, batch_size): h = torch.zeros(self.gru_num_layers * 2, batch_size, self.gru_hidden_size) self.gru_h = Variable(h) crnn = CRNN() criterion = nn.CTCLoss(blank=10, reduction='mean', zero_infinity=True) optimizer = torch.optim.Adam(crnn.parameters(), lr=0.001)
I’ve tested the model using 10,000 generated sequences: 8,000 for training and 2,000 for testing. Below are the plots for training and testing loss and also the evolution of precision - I’m considering that the dataset is approximately balanced. A true positive (TP) is counted only when the recognized sequence entirely matches the ground truth. The results are not ideal but I think the current model represents a decent starting point for greater projects.
The CRNN manifests some overfitting behavior but the results are acceptable considering its purpose.
After 6 epochs, the CRNN successfully recognizes 7567 out of 8000 sequences in the training set and 1776 out of 2000 from the testing set.