Text recognition is often applied in the areas of computer vision and pattern recognition, specifically for applications where conversion of visual images to digital text is required. Optical character recognition (OCR) systems use flatbed scanners to convert paper documents to digital documents use this technology to convert images of text to digital text. Mobile vision applications such as translation services on smart-phone devices can also use this technology to translate foreign-language text from a picture that a user takes. In the field of digital map creation this technology can be used to create digital content from images sampled periodically, such as from vehicle-mounted devices. From these images, the text on storefronts and road signs can be identified and used to create point of interest (POI) information. However, the current approaches to text detection have not provided as much improvement as initially hoped.
Currently the most successful application in which text recognition systems have been applied is the document conversion system, which often has an accuracy of over 90%. Beyond the flatbed scanner arrangement text recognition systems have not been very successful. One reason for this disparity is that in natural scene images there are unrestricted lighting and view conditions which diminish text recognition accuracy. One approach to solve this problem has been to employ a natural scene text detection algorithm, which is typically applied to localize the text before any recognition attempts are made. The localized text would then have a better lighting condition and could be better used in the second stage of text recognition or pattern matching. However, this approach has not provided as much improvement as initially hoped.
In very broad terms, text detection can be primarily divided into two separate categories: 1) region based text detection; and 2) connected component-based text detection. In the region based method, a sliding window is applied over the digital image and a test is applied to classify whether the window contains text or not. See for example Y. Zhong, H. Zhang, and A. K. Jain, “Automatic caption localization in compressed video,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 4, pp. 385-392, 2000 (using features in the discrete cosine transform space to classify the region); and also X. Chen and A. L. Yuille, “A time-efficient cascade for real-time object detection: With applications for the visually impaired” in CVPR—Workshops, 2005, p. 28 (using intensity, gradients and features; and training an Adaboost algorithm to perform the classification).
In the connected component approach, the digital image which is being analyzed for text is first transformed into a binary image. Connected components within the image are considered as character candidates. These character candidates are paired and linked to form text lines. The geometric property of text lines are typically used to filter out false positives; see for example A. Clavelli and D. Karatzas, “Text Segmentation in Colour Posters from the Spanish Civil War Era”, Int. Conf. on Document Analysis and Recognition, 2009, pp. 181-185; B. Epshtein, E. Ofek, and Y. Wexler, “Detecting text in natural scenes with stroke width transform” in CVPR, 2010, pp. 2963-2970 (see also US Patent Application Publication 2009/0285482 by these same three individuals and similarly titled); and also H. Chen, S. S. Tsai, G. Schroth, D. Chen, R. Grzeszczuk, B. Girod, “Robust text detection in natural images with edge-enhanced maximally stable extremal regions,” in ICIP, 2011.
The work by Epshtein et al. considers a text detection scheme based on Stroke Width Transform (SWT). Specifically, the Ephstein et al technique uses a Canny edge detector [see Canny, J., “A Computational Approach To Edge Detection” IEEE Trans. Pattern Analysis and Machine Intelligence, 8(6):679-698, 1986] to find the edges in the image, and then try to find the two sides of a character stroke by shooting a ray in the gradient direction of each detected edge, forming the character candidates based on the corresponding edges. The technique by H. Chen et al. uses MSERs [see for example J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide baseline stereo from maximally stable extremal regions” in British Machine Vision Conference, 2002, vol. 1, pp. 384-393] as character candidates which are enhanced using Canny edges. This technique by H. Chen et al also uses a distance transform based method to calculate the stroke width. And finally, Lukas Neumann, Jiri Matas, “Text localization in real-world images using efficiently pruned exhaustive search”, Int. Conf. on Document Analysis and Retrieval, 2011, uses an extended MSER region to extend beyond a bi-level processing. The inventors herein consider the technique in Lukas Neumann et al to localize the text by an exhaustive search throughout all possible regions as too time consuming.
What is needed in the art is an improvement for recognizing text in natural scenes captured via digital imaging, and particularly suitable for use with dynamic applications noted above such as gathering point of interest information (in a smart-phone for example) and creating digital maps (in a vehicle-mounted camera for example).