# PyTorch CRNN: Seq2Seq Digits Recognition w/ CTC

This article discusses handwritten character recognition (**OCR**) in images using *sequence-to-sequence* (**seq2seq**) mapping performed by a *Convolutional Recurrent Neural Network* (**CRNN**) trained with *Connectionist Temporal Classification* (**CTC**) loss. The aforementioned approach is employed in multiple modern OCR engines for handwritten text (e.g., Google’s Keyboard App - convolutions are replaced with Bezier interpolations) or typed text (e.g., Tesseract 4’s CRNN Based Recognition Module).

For the sake of simplicity, the example I’ll be presenting performs only digit recognition but can be easily extended to accommodate more classes of characters.

##### The overall source code for this project is quite long so I’m providing a Google Colab document that includes a working example.

## Previous Inadequacies and Justification

“Why not simply segment characters in the image and recognize them one by one?”

While the approach is, indeed, more straightforward and has been incorporated in older OCR engines, it has its caveats, especially when considering handwritten text. These are caused by the imperfections of the written characters which can produce segmentation issues thus attempting to recognize invalid glyphs or symbols. Consider the following images for clarification:

Whereas the MNIST problem is considered solved thus implying that reliable classifiers can be constructed to individually recognize digits, the problem of correct segmentation still remains in realistic scenarios. Splitting or merging glyphs to form valid digits proves to be a difficult challenge and requires additional knowledge to be embedded into the segmentation module.

## Seq2Seq Classifications

In this context, the main advantage brought by a **seq2seq** classifier is that it diminishes the impact of erroneous segmentations and takes advantage of the ability of a neural network to generalize. It only requires a valid segmentation of the word or text line in cause.

Consider the following simplistic model that has a **sliding window** or **mask** (no convolutions), of size `(1, img_height)`

. Each set of pixels covered by the sliding window is fed into a neural network made out of neurons with **memory** (e.g., **GRU** or **LSTM**); the job of the neural network is to take a sequence of such stripes and output recognized digits. Take a look at the following figure:

Multiple digits will be included in a single sequence - because we’re feeding the network an image which contains more than a digit. It is up to the neural network to determine during the training phase how many stripes to take into account when classifying a digit (i.e., how much to memorize). The image below illustrates how a RNN should ‘group’ stripes together in order to recognize each digit in the sequence.

Using this method, it is possible to train a neural network by simply saying that the image above contains the numbers ‘**55207**’, without further information (e.g.: alignment, delimitations, bounding boxes etc.)

## CTC and Duplicates Removal

CTC loss is most commonly employed to train seq2seq RNNs. It works by **summing** the **probabilities for all possible alignments**; the **probability of an alignment** is determined by **multiplying** the probabilities of having specific digits in certain slots. An alignment can be seen as a plausible sequence of recognized digits.

Going back to the ‘**55207**’ example, we can express the probability of the alignment as follows:

To properly remove duplicates and also correctly handle numbers that contain repeating digits, the **blank** class is introduced, with the following rules:

- 2 (or more)
**repeating digits**are**collapsed**into a single instance of that digit unless separated by**blank**- this compensates for the fact that the RNN performs a classification for each stripe that represents a part of a digit (thus producing duplicates) - multiple
**consecutive blanks**are**collapsed**into one blank - this compensates for the spacing before, after or between the digits

Given these aspects, there are multiple alignments that, once collapsed, lead to the correct alignment (‘**55207**’).

For example:
**55-55222–07** once collapsed leads to ‘**55207**’ and suggests the correct sequence even though it has a different structure because of additional duplicates and blanks (marked as ‘**-**’ here). The probability of this alignment () is computed as previously shown but it also includes the probabilities of the blank class:

Finally, the CTC probability of a sequence is calculated, as previously mentioned, by summing the probabilities for all different alignments:

When training, the neural network attempts to maximize this probability for the sequence provided as ground truth.

A **decoding** method is used to recover the text from a set of digits probabilities; a naive approach would be to pick, for **each slot** in the **alignment**, the digits with the **highest probability** and the collapse the result. This approach is easier to implement and might be enough for this example although **beam search** (i.e.: greedy approach that picks first N digits with highest probabilities, instead of only one) is employed for such tasks in larger projects.

## Including Convolutional Layers

Implementing convolutions in the previously described model simply implies that raw pixel information is replaced, in the input of the RNN, with higher level features. In PyTorch, the output of the convolution layers must be reshaped to the time sequence format `(batch_size, sequence_length, gru_input_size)`

.

In the current project, the output of the convolution part has the following shape: `(batch_size, num_channels, convolved_img_height, convolved_img_width)`

. I’m permuting the tensor to `(batch_size, convolved_img_width, convolved_img_height, num_channels)`

and then reshaping the last 2 dimensions into one which becomes `gru_input_size`

).

## Dataset Generation

To avoid additional steps such as image preprocessing, segmentation and class balancing I picked a more friendly dataset: **EMNIST** for digits. The following helper script randomly picks digits from the dataset, applies affine augmentations and concatenates them into sequences of a given length.

## CRNN Model

A LeNet-5 based convolution model is employed, with the following modifications:

- 5x5 filters are replaced with 2 consecutive 3x3 filters
- max-pooling is replaced with strided convolutions

The resulted higher level features are fed into a **Bi-GRU** RNN with a **linear** layer in the end that generated the required number of classes (**9** + 1). I’ve chosen **GRU** over **LSTM** since it had similar results but required fewer resources. A `log_softmax`

activation function is used.

The text is decoded using a simple **best path** algorithm.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82

class CRNN(nn.Module):
def __init__(self):
super(CRNN, self).__init__()
self.num_classes = 10 + 1
self.image_H = 28
self.conv1 = nn.Conv2d(1, 32, kernel_size=(3,3))
self.in1 = nn.InstanceNorm2d(32)
self.conv2 = nn.Conv2d(32, 32, kernel_size=(3,3))
self.in2 = nn.InstanceNorm2d(32)
self.conv3 = nn.Conv2d(32, 32, kernel_size=(3,3), stride=2)
self.in3 = nn.InstanceNorm2d(32)
self.conv4 = nn.Conv2d(32, 64, kernel_size=(3,3))
self.in4 = nn.InstanceNorm2d(64)
self.conv5 = nn.Conv2d(64, 64, kernel_size=(3,3))
self.in5 = nn.InstanceNorm2d(64)
self.conv6 = nn.Conv2d(64, 64, kernel_size=(3,3), stride=2)
self.in6 = nn.InstanceNorm2d(64)
self.postconv_height = 3
self.postconv_width = 31
self.gru_input_size = self.postconv_height * 64
self.gru_hidden_size = 128
self.gru_num_layers = 2
self.gru_h = None
self.gru_cell = None
self.gru = nn.GRU(self.gru_input_size, self.gru_hidden_size, self.gru_num_layers, batch_first = True, bidirectional = True)
self.fc = nn.Linear(self.gru_hidden_size * 2, self.num_classes)
def forward(self, x):
batch_size = x.shape[0]
out = self.conv1(x)
out = F.leaky_relu(out)
out = self.in1(out)
out = self.conv2(out)
out = F.leaky_relu(out)
out = self.in2(out)
out = self.conv3(out)
out = F.leaky_relu(out)
out = self.in3(out)
out = self.conv4(out)
out = F.leaky_relu(out)
out = self.in4(out)
out = self.conv5(out)
out = F.leaky_relu(out)
out = self.in5(out)
out = self.conv6(out)
out = F.leaky_relu(out)
out = self.in6(out)
out = out.permute(0, 3, 2, 1)
out = out.reshape(batch_size, -1, self.gru_input_size)
out, gru_h = self.gru(out, self.gru_h)
self.gru_h = gru_h.detach()
out = torch.stack([F.log_softmax(self.fc(out[i])) for i in range(out.shape[0])])
return out
def reset_hidden(self, batch_size):
h = torch.zeros(self.gru_num_layers * 2, batch_size, self.gru_hidden_size)
self.gru_h = Variable(h)
crnn = CRNN()
criterion = nn.CTCLoss(blank=10, reduction='mean', zero_infinity=True)
optimizer = torch.optim.Adam(crnn.parameters(), lr=0.001)

## Results

I’ve tested the model using 10,000 generated sequences: 8,000 for training and 2,000 for testing. Below are the plots for training and testing loss and also the evolution of **precision** - I’m considering that the dataset is approximately balanced. A *true positive* (**TP**) is counted only when the recognized sequence entirely matches the ground truth. The results are not ideal but I think the current model represents a decent starting point for greater projects.

The CRNN manifests some overfitting behavior but the results are acceptable considering its purpose.

After 6 epochs, the CRNN successfully recognizes **7567** out of **8000** sequences in the training set and **1776** out of **2000** from the testing set.