Systems and methods herein generally relate to automated recognition systems and more particularly to optical character recognition (OCR) and similar systems.
Representing word images and text strings in a common space with an associated similarity metric allows a user to cast the recognition of a word image as a retrieval problem: given a word image, a system can rank all possible transcriptions (i.e., the lexicon) and use the most similar one to the image word as the predicted transcription. This is very useful if a system has advanced knowledge regarding the set of possible valid transcriptions, and this set is small. For example, a system may be interested in finding keywords related to shops in street images, or in recognizing license plates and knowing that the right transcription is in a small validated set.
Unfortunately, in other important domains, such as license plate recognition with no constraints, where the space of possible transcriptions is vastly huge, encoding and ranking all possible transcriptions is not feasible. As an illustrative example, there are 368≈3*1012, 3 trillion possible license plates of 8 characters using only digits and the 26 characters of the Latin alphabet. Representing all these license plate transcriptions using the text features of 192 dimensions would require almost 2000 terabytes, and computing the similarities between only one image and all the license plate transcriptions would take approximately 32 hours. Therefore, a solution that exploits the benefits of describing word images with global signatures but can still perform recognition with potentially infinite lexicons is sought.