In this work I took a look at Tesseract 4’s performance at recognizing characters from a challenging dataset and proposed a minimalistic convolution-based approach for input image preprocessing that can boost the character-level accuracy from 13.4% to 61.6% (+359% relative change), and the F1 score from 16.3% to 72.9% (+347% relative change) on the aforementioned dataset. The convolution kernels are determined using reinforcement learning; moreover, to simulate the lack of ground truth in realistic scenarios, the training set consists of only 30 images while the testing set includes 10,000.
The dataset in cause is called Brno Mobile, and contains colored photographs of typed text, taken with handheld devices. Factors such as blurriness, low resolution, contrast, brightness are contributing to making the images challenging for an OCR engine.
During this experiment, the out of the box version of Tesseract 4 has been used, which implies:
- no retraining of the OCR engine
- no lexicon / dictionary augmentations
- no hints about the language used in the dataset
- no hints about segmentation methods; default (automatic) segmentation is used
- default settings for the recognition engine (LSTM + Tesseract)
Tesseract 4 has proven great performance when tested on favorable datasets by achieving good balance between precision and recall. It is presumed that this evaluation is performed on images that resemble scanned documents or book pages (with or without additional preprocessing) in which the number of camera-caused distortions is minimal. Tests on the Brno dataset led to much worse performance that will be discussed later in the article.
In the above figure, a high precision indicates favorable True-Positives to False-Positives ratio thus revealing proper differentiation between characters (i.e. a relatively small number of misclassifications). Despite this, almost no improvements in recall can be observed when switching from the base classification method to the Long Short-Term Memory (LSTM) based Convolutional Recurrent Neural Network (CRNN) for sequence to sequence mapping.
“Despite being designed over 20 years ago, the current Tesseract classifier is incredibly difficult to beat with so-called modern methods.” - Ray Smith, author of Tesseract
I assume that further training for different fonts might not provide significant improvements and neither will a different model of classifier. Is there a chance that the classifier doesn’t receive the correct input?
It was pointed out in a previous article that Tesseract is not robust to noise; certain salt-and-pepper noise patterns disrupt the character recognition process, leading to large segments of text being completely ignored by the OCR engine - the infamous empty string. From empirical observations, these errors seem to occur either for a whole word or sentence or not at all thus suggesting a weakness in the segmentation methodology.
The existence of similar behavior, given images which present more natural distortions, is questioned - hence this experiment.
Since analyzing Tesseract’s segmentation methods is a daunting task, I opted for an adaptive external image correction method. To avoid diving into Tesseract 4’s source code, the OCR engine is considered a black-box; in this case, an unsupervised learning method must be employed. This ensures easier transitions to other OCR engines as it doesn’t directly rely on concrete implementations but only on outputs - at the cost of processing power and optimality.
The solution consists in directly preprocessing images before they are fed to Tesseract 4. An adaptive preprocessing operation is required, in order to properly compensate for any image features that cause problems in the segmentation process. In other words, an input image must be adapted so it complies with Tesseract 4’s preferences and maximizes the chance of producing the correct output, preferably without performing down-sampling.
I choose a convolution-based approach for flexibility and speed; other articles tend to perform more rigid image adjustments (such as global changes in brightness, fixed-constant conversion to grayscale, histogram equalization, etc.). I preferred an approach that can properly learn to highlight or mask regions of the image according to various features. For this, the kernels are optimized using reinforcement learning using an actor-critic model. To be more specific, it relies on Twin Delayed Deep Deterministic Policy Gradient (TD3 for short), for discovering features which minimize the Levenshtein distance between the recognized text and the ground truth. I’ll not dive into implementation details of TD3 here as it would be somehow out of scope but think of it as a method of optimizing the following formula:
Where is a kernel, and is a tuple from the training set.
A short (simpler) proof of concept of the convolutional preprocessor is presented in this Google Colab. It uses a different architecture than the final one and has the purpose of verifying if the idea of using convolutions is feasible and offers good results. A comparison is presented between original and preprocessed images including recognized texts for each sample.
The final model is illustrated below, with ReLU activations after each convolution to capture nonlinearities and prevent having negative values as pixels’ colors.
To properly compensate for image coloring and reduce the number of channels (R, G, B), 1x1 convolutions are used. This prevents overfitting up to a point while also ensuring grayscale output. Further convolutions are applied only on the grayscale image.
Symmetry constraints are additionally enforced for each 3x3 kernel in order to minimize the number of trainable parameters and avoid overfitting. This means that for a 3x3 kernel only 6 variables out of 9 must be determined while the rest can be generated through mirroring. Below are the values I got for the five kernels (bold to emphasize symmetry):
I extracted the image from each convolution layer and clamped its values to the 0-255 interval to properly view each transformation:
I used 10,000 images from the testing set for the evaluation of the current methodology and compiled the following graphs. The differences between original and preprocessed samples are illustrated with three metrics of interest: Character Error Rate (CER), Word Error Rate (WER) and Longest Common Subsequence Error (LCSE). In this article, LCSE is computed as follows:
Additionally, I plotted everything in histogram format to properly see the distributions of errors. For CER and WER, it is easy to observe the spikes around 1 (100%) that suggest the aforementioned segmentation problem (at block-of-text level) produces the most frequent error (empty strings are returned so all characters are wrong). In certain situations, the WER is larger than 1 because the preprocessing step introduces artifacts near the border of the image thus leading to recognition of non-existent characters. When looking at the LCSE plot, a distribution shift can be seen from the original approximately gaussian shape with its peak (mode) near the average number of characters in an image (56.95) to a more favorable shape with overall lower error rates.
A numeric comparison is presented below:
|Metric||Original (Avg.)||Preprocessed (Avg.)|
Significant improvements can be observed through this preprocessing operation. Moreover, the majority of errors probably do not occur in the sequence to sequence classifier (since all the recognized characters are erroneous and would contradict previous performance analysis). A page-segmentation issue when automatic mode is used seems more plausible. It is shown that an array of convolutions is sufficient, in this case, to decrease error rates substantially.
The OCR performance on the preprocessed images is overall better but not good enough to be reliable. A 38% character error rate is still a large setback. I’m pretty sure that better recognitions can be obtained with more fine-tuning, a more complex architecture for the convolutional preprocessor and a more diverse training set. However, the current implementation is already very slow to train which makes me question if the entire methodology is feasible from this point of view.