Identification of text regions in papers that are optically scanned (e.g. by a flatbed scanner of a photocopier) is significantly easier (e.g. due to upright orientation, large size and slow speed) than detecting regions that may contain text in scenes of the real world that may be captured in images (also called “natural images”) or in video frames in real time by a handheld device (such as a smartphone) having a built-in digital camera. Specifically, optical character recognition (OCR) methods of the prior art originate in the field of document processing, wherein the document image contains a series of lines of text (e.g. 30 lines of text) of an optically scanned page in a document.
Document processing techniques, although successfully used on scanned documents created by optical scanners, generate too many false positives and/or negatives so as to be impractical when used on natural images containing text. Hence, detection of text regions in a real world image generated by a handheld camera is performed using different techniques. For additional information on techniques used in the prior art, to identify text regions in natural images, see the following articles that are incorporated by reference herein in their entirety as background:    (a) LI, et al. “Automatic Text Detection and Tracking in a Digital Video”, IEEE Transactions on Image Processing, January 2000, pages 147-156, Volume 9, No. 1;    (b) CHEN, et al. “Detecting and reading text in natural scenes,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'04), 2004, pages 1-8;    (c) LEE, et al. “A new methodology for gray-scale character segmentation and recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, October 1996, pp. 1045-1050, vol. 18, no. 10;    (d) EPSHTEIN, et al. “Detecting text in natural scenes with stroke width transform,” Computer Vision and Pattern Recognition (CVPR) 2010, pages 2963-2970, (as downloaded from “http://research.microsoft.com/pubs/149305/1509.pdf”); and    (e) JAIN, et al. “Automatic text location in images and video frames”, Pattern Recognition, 1998, pp. 2055-2076, Vol. 31, No. 12.
Image processing techniques of the prior art described above appear to be developed primarily to identify regions in images that contain text which is written in the language English. Use of such techniques to identify in natural images, regions of text in other languages that use different scripts for letters of their alphabets can result in false positives and/or negatives so as to render the techniques impractical.
FIG. 1 illustrates a newspaper in the real world scene 100 in India. A user 110 (see FIG. 1) may use a camera-equipped mobile device (such as a cellular phone) 108 to capture an image 107 (also called “natural image” or “real world image”) of scene 100. Camera captured image 107 may be displayed on a screen 106 (FIG. 1) of mobile device 108. Such an image 107 (FIG. 1), if processed directly using prior art image processing techniques may result in failure to recognize one or more words in a region 103 (see FIG. 1). Specifically, use of prior art methods can cause problems when used with words that have modifiers, such as a dot located on top of a word, e.g. DOT maatra  expressed in a language, such as Hindi that uses the Devanagari script.
Depending on variations in lighting, color, tilt, focus, font etc, pixels that constitute a dot may or may not be included in a rectangular portion of the image that is being processed by OCR. The dot is just one of over ten (10) accent marks that may be used in the language Hindi. Moreover, presence of different fonts in addition to the large number of letters (including conjunct consonants) of the alphabet in Devanagari requires an OCR decoder to recognize a very large number of characters, resulting in a very complex system with poor recall accuracy.
Accordingly, there is a need to improve identification of words formed by Devanagari (also spelled Devanagiri) characters in a natural image or video frame, as described below.