Identification of text regions in documents that are scanned (e.g. by an optical scanner of a printer or copier) is significantly easier than detecting text regions in images generated by a handheld camera, of scenes in the real world (also called “natural images”). Specifically, optical character recognition (OCR) methods of the prior art originate in the field of document processing, wherein the document image contains a series of lines of text (e.g. 20 lines of text) of a scanned page in a document. Document processing techniques, although successfully used on scanned documents created by optical scanners, generate too many false positives and/or negatives so as to be impractical when used on natural images. Hence, detection of text regions in a real world image generated by a handheld camera is performed using different techniques. For additional information on techniques used in the prior art, to identify text regions in natural images, see the following articles that are incorporated by reference herein in their entirety as background:    (a) H. Li et al. “Automatic text detection and tracking in digital video,” IEEE transactions on Image processing, vol. 9., no. 1, pp. 147-156, 2000;    (b) X. Chen and A. Yuille, “Detecting and reading text in natural scenes,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'04), 2004, pages 1-8;    (c) S. W. Lee et al, “A new methodology for gray-scale character segmentation and recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, October 1996, pp. 1045-1050, vol. 18, no. 10;    (d) B. Epshtein et al, “Detecting text in natural scenes with stroke width transform,” Computer Vision and Pattern Recognition (CVPR) 2010, pages 2963-2970; and    (e) A. Jain and B. Yu, “Automatic text location in images and video frames”, Pattern Recognition, 1998, pp. 2055-2076, Vol. 31, No. 12.Image processing techniques of the prior art described above appear to be developed primarily to identify regions in images that contain text which is written in the language English. Use of such techniques to identify in natural images, regions of text in other languages that use different scripts for letters of their alphabets can result in false positives and/or negatives so as to render the techniques impractical.
FIG. 1A illustrates a newspaper 100 in the real world in India. A user 110 (see FIG. 1B) may use a camera-equipped mobile device (such as a cellular phone) 108 to capture an image 107 of newspaper 100. Camera captured image 107 may be displayed on a screen 106 of mobile device 108. Such an image 107 (FIG. 1C), if processed directly using prior art image processing techniques may result in failure to classify one or more regions 103 as text (see FIG. 1A). Specifically, text-containing regions of a camera-captured image may be classified as non-text and vice versa e.g. due to variations in lighting, color, tilt, focus, etc.
Additionally, presence in natural images, of text written in non-English languages, such as Hindi can result in false positives/negatives, when technique(s) developed primarily for identifying text in the English language are used in classification of regions as text/non-text. For example, although blocks in regions that contain text in the English language may be correctly classified to be text (e.g. by a neural network), one or more blocks 103A, 103B, 103C and 103D (FIG. 1C) in a region 103 contain text in Hindi that may be mis-classified as non-text (e.g. even when the neural network has been trained with text in Hindi).
One or more prior art criteria that are used by a classifier to identify text in natural images can be relaxed, so that blocks 103A-103D are then classified as text, but on doing so one or more portions of another region 105 (FIG. 1C) may coincidentally satisfy the relaxed criteria, and blocks in region 105 may be then mis-classified as text although these blocks contain graphics (e.g., pictures of cars in FIG. 1B).
Moreover, when a natural image 107 (FIG. 1C) is processed by a prior art method to form rectangular blocks, certain portions of text may be omitted from a rectangular block that is classified as text. For example, pixels in such text portions may be separated from (i.e. not contiguous with) pixels that form the remainder of text in the rectangular block, due to pixels at a boundary of the rectangular block not satisfying a prior art test used to form the rectangular block. Such omission of pixels of a portion of text, from a rectangular block adjacent to the portion is illustrated in FIG. 1C at least twice. See pixels of text to the left of block 103B, and see pixels of text to the left of block 103C (in FIG. 1C). Also, when skew becomes large (e.g. 30 degrees) as illustrated in FIG. 1D, several prior art classifiers fail to classify the block correctly. Even assuming skew is corrected, omission of text portions from rectangular blocks of a natural image can result in errors, when such incomplete blocks are further processed after classification, e.g. by an optical character recognition (OCR) system.
Accordingly, there is a need to improve the identification of regions of text in a natural image or video frame, as described below.