This disclosure relates to learning word image representations, i.e., given an image of a word, finding a descriptive and robust, fixed-length representation of the word. Machine learning techniques can then be used on top of these representations to produce models useful for word retrieval or recognition tasks. Although the focus of this disclosure is recognizing text in the wild, e.g., street signs, this disclosure and the exemplary embodiments provided herein have broad applications and can be used for the recognition of standard printed text, handwritten text, license plates, etc.
In recent years, a new trend in word image recognition and retrieval has been to describe word images with global representations using standard computer vision features, e.g., HOG (Histogram of Oriented Gradients), or SIFT (Scale-Invariant Feature Transform) features aggregated with bags of words, see G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual Categorization with Bags of Keypoints”, in ECCV Workshop on Statistical Learning in Computer Vision, 2004, or Fisher vector, see F. Perronnin and C. R. Dance, “Fisher Kernels on Visual Vocabularies for Image Categorization”, in CVPR, 2007, encodings, and apply different frameworks and machine learning techniques, such as using attributes representations, metric learning, or exemplar SVMs (Support Vector Machines), on top of these global representations to learn models to perform tasks such as recognition, retrieval, or spotting. See F. Perronnin and J. A. Rodriguez-Serrano, “Fisher kernels for handwritten word-spotting”, in ICDAR, 2009; M. Rusiñol, D. Aldavert, R. Toledo, and J. Lladós, “Browsing heterogeneous document collections by a segmentation-free word spotting method”, in ICDAR, 2011; J. Almazán, A. Gordo, A. Fornés, and E. Valveny, “Efficient exemplar word spotting”, in BMVC, 2012; D. Aldavert, M. Rusiñol, R. Toledo and J. Lladós, “Integrating Visual and Textual Cues for Query-by-String Word Spotting”, in ICDAR, 2013; J. Almazán, A. Gordo, A. Fornés, and E. Valveny, “Handwritten word spotting with corrected attributes”, in ICCV, 2013; J. A. Rodriguez-Serrano and F. Perronnin, “Label embedding for text recognition”, in BMVC, 2013; J. Almazán, A. Gordo, A. Fornés, and E. Valveny, “Word spotting and recognition with embedded attributes”, Technical report, 2014. This disclosure diverges from traditional approaches that mostly focus on detecting and localizing individual characters in a word image and then using that information to infer the contents of the word, using for example Conditional Random Fields and language priors. See A. Mishra, K. Alahari, and C. V. Jawahar, “Scene text recognition using higher order language priors”, in BMVC, 2012; A. Mishra, K. Alahari, and C. V. Jawahar, “Top-down and bottom-up cues for scene text recognition”, in CVPR, 2012; A. Bissacco, M. Cummins, Y. Netzer, and H. Neven, “PhotoOCR: Reading Text in Uncontrolled Conditions”, in ICCV, 2013; and L. Neumann and J. Matas, “Scene Text Localization and Recognition with Oriented Stroke Detection”, in ICCV, 2013.
The global approaches have important advantages such as not requiring words annotated with character bounding boxes for training or not requiring to explicitly localize characters inside the words at testing time, which is slow and error prone. Global approaches can also produce compact signatures which are relatively fast to compute, store and index, or compare, while still obtaining very competitive results in many tasks. The use of off-the-shelves computer vision features and machine learning techniques also makes them very attractive since they are usually very easy to implement. Yet, it may be argued that learning more semantic features that specialize in text instead of directly using generic features such as SIFT or HOG can potentially lead to large improvements.
In an orthogonal direction, it has been noted that methods that detect individual characters can yield very impressive results on recognition tasks when training with huge amounts of words annotated with character bounding boxes. See A. Bissacco, M. Cummins, Y. Netzer, and H. Neven, “PhotoOCR: Reading Text in Uncontrolled Conditions”, in ICCV, 2013. However, this suffers from all the aforementioned disadvantages: need to annotate very large amounts of words at the character level, need to localize the characters at test time, and not leading to an amenable final signature that can be used for other tasks such as word retrieval.
This disclosure, and the exemplary embodiments described herein, describes an approach to leverage character bounding box information at training time to construct global image representations learned in a supervised way, as opposed to standard global representations that only involve an unsupervised learning stage. The methods and systems provided involve learning mid-level local features that are correlated with the characters in which they tend to appear. A small external dataset annotated at the character level is used to learn how to transform small groups of locally aggregated low-level features into mid-level semantic features suitable for text analysis. This is achieved by describing blocks of images using two modalities: a semantic modality, based on the character annotations; and a visual modality, based only on the image, and subsequently learning how to project the visual and the annotated modalities into a common subspace. Described also is how to apply this transformation in an efficient manner to new images at testing time.
The disclosed global image representation method and system has the following potential advantages:                Leverages character bounding boxes at training time to provide discriminative information that can significantly improve the final accuracy of the system. Global representations have not leveraged this information in the past.        Prior methods that have leveraged character bounding boxes were focused on recognition and did not produce a global image representation. Consequently, this makes them less useful for other tasks such as text-to-image or image-to-image retrieval. Mid-level features associated with the global image representation provided herein can be aggregated into a compact global representation, for example, but not limited to, as low as 96 dimensions, and directly used for image-to-image, text-to-image, and image-to-text retrieval tasks.        The global representation method and system provided herein can be learnt with very little annotated data. The best performing method known previously on scene-text recognition requires millions of annotated training samples with bounding boxes. See A. Bissacco, M. Cummins, Y. Netzer, and H. Neven, “PhotoOCR: Reading Text in Uncontrolled Conditions”, in ICCV, 2013. In comparison, the disclosed approach achieves better results with less than 2,000 samples annotated with bounding boxes.        
Many previous works focus on recognition learn character models for isolated characters. See A. Mishra, K. Alahari, and C. V. Jawahar, “Scene text recognition using higher order language priors”, in BMVC, 2012; A. Mishra, K. Alahari, and C. V. Jawahar, “Top-down and bottom-up cues for scene text recognition”, in CVPR, 2012; A. Bissacco, M. Cummins, Y. Netzer, and H. Neven, “PhotoOCR: Reading Text in Uncontrolled Conditions”, in ICCV, 2013; L. Neumann and J. Matas, “Scene Text Localization and Recognition with Oriented Stroke Detection”, in ICCV, 2013; and C. Yao, X. Bai, B. Shi, and W. Liu, “Strokelets: A Learned Multi-Scale Representation For Scene-Text Recognition”, in CVPR, 2014, as recent examples. However, these methods focus only on recognition and do not produce image signatures. Therefore, these models cannot be easily used for retrieval with image or text queries, which is a requirement for some applications, such as license plate matching.
Other previous work relates to the use of mid-level features, where “blocks” containing some basic semantic information are discovered/learned/defined. See Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce, “Learning mid-level features for recognition”, in CVPR, 2010; M. Juneja, A. Vedaldi, C. Jawahar, and A. Zisserman, “Blocks that shout: Distinctive parts for scene classification”, in CVPR, 2013; and C. Yao, X. Bai, B. Shi, and W. Liu, “Strokelets: A Learned Multi-Scale Representation For Scene-Text Recognition”, in CVPR, 2014. The use of mid-level features has been shown to produce large improvements in different tasks. Of those works, the most related to this disclosure is the work of C. Yao, X. Bai, B. Shi, and W. Liu, “Strokelets: A Learned Multi-Scale Representation For Scene-Text Recognition”, in CVPR, 2014, which learns Strokelets, a mid-level representation that can be understood as “parts” of characters. These are then used to represent characters in a more semantic way. The main distinctions between this previous use of mid-level features and this disclosure are: i) exploitation of supervised information is used to learn a more semantic representation according to this disclosure, and ii) explicit classification of character blocks is not used, and instead use of a semantic representation is used to construct a high-level word image signature according to this disclosure.
The embedding approaches of J. A. Rodriguez-Serrano and F. Perronnin, “Label embedding for text recognition”, in BMVC, 2013 and J. Almazán, A. Gordo, A. Fornés, and E. Valveny, “Handwritten word spotting with corrected attributes”, in ICCV, 2013 could also be understood as producing supervised mid-level features, but do not use character bounding box information to do so. To learn the semantic space, supervised dimensionality reduction of local Fisher vectors is performed and the Fisher vectors are then encoded and aggregated into a global Fisher vector. This could be understood as a deep Fisher network for image recognition. See K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep fisher networks for large-scale image classification”, in NIPS, 2013. A main difference is that the goal of K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep fisher networks for large-scale image classification”, in NIPS, 2013 is to produce an extra layer on the Fisher representation, similar in spirit to deep networks; the supervised dimensionality step is learned using the image labels, the same labels that will be used for the final classification step. According to the disclosed method and system, the goal is to transfer knowledge from information that is only available on an external training dataset, i.e., the character bounding boxes, to produce features that are correlated with characters, and exploit this information from the target datasets.