The exemplary embodiment relates to text recognition and image retrieval based on semantic information and finds particular application in connection with assigning semantic labels to word images and to recognition of word images corresponding to semantic labels.
There has been an increasing interest in text recognition and retrieval in natural images. See, e.g., D. Karatzas, et al., “ICDAR 2013 Robust Reading Competition,” ICDAR, pp. 1484-1493, 2013; K. Wang, et al., “End-to-end Scene Text Recognition,” ICCV, pp. 1457-1464, 2011, hereinafter, Wang 2011. For example, given an image of a word, the word may be recognized, using a list of possible transcriptions or in an unconstrained manner. See, e.g., J. Almazán, et al., “Word spotting and recognition with embedded attributes,” TPAMI, pp. 2552-2566, 2014, hereinafter, Almazán 2014; A. Gordo, “Supervised mid-level features for word image representation,” CVPR, pp. 2956-2964, 2015, hereinafter, Gordo 2015; Wang 2011; A. Bissacco, et al., “PhotoOCR: Reading Text in Uncontrolled Conditions,” ICCV, pp. 785-792, 2013; M. Jaderberg, et al., “Deep structured output learning for unconstrained text recognition,” ICLR, 2015, hereinafter, Jaderberg 2015. Typically, a candidate word is first localized in an image. Then it is cropped, and recognized. Word image retrieval may also be performed, which is essentially the reverse process: given a query, which can be either a text string or another word image, the aim is to retrieve the relevant word images in a dataset (Almazán 2014, Gordo 2015).
In all these cases, the goal has been to retrieve or recognize exactly the same word used as a query, without taking semantics into consideration. For example, given a query image which includes the word phoenix, it would be transcribed as phoenix, without any consideration of its meaning. Similarly, using the text string restaurant as a query would only retrieve images containing this word in them.
There are several applications, however, where it would be advantageous to consider semantics in word image recognition and retrieval. For example, it could be useful to capture the semantic meanings of the word phoenix as both a city and a state capital, and also its semantic meaning as a mythical being. Semantics play a role in scene understanding and for scene text, particularly in urban scenarios. To illustrate this, given a street scene it could be useful to classify building faces into different business classes such as restaurants, hotels, banks, etc. While the presence of a sign pizzeria is indicative of a restaurant, the mere transcription of the text in the sign is not sufficient in itself to deduce this. Additional reasoning capabilities enabled by an understanding of the semantics of the word are required to make the classification decision.
A two-step approach for achieving this objective could include first transcribing the word image to identify a character sequence, and then matching the transcription to a lexical resource which defines a set of semantic concepts. However, this approach has some shortcomings. First, it relies on an accurate transcription of the word images. Although word image recognition has improved significantly in recent years, the results are still not perfect, particularly when word images are not cropped exactly. Second, the approach cannot deal with out-of-vocabulary words. Even if a word is transcribed correctly, if the word does not appear in the lexical resource, it will not be possible to assign concepts to it. Finally, this approach does not lead to a compact representation of word images that encodes semantics.
Therefore, there remains a need for a system and method for recognition and retrieval based on word images that does not suffer from such disadvantages.