Handheld devices and mobile devices such as a cell phone 108 (FIG. 1A) include a digital camera for use by a person 110 with their hands to capture an image of a real world scene 100, such as image 107, shown displayed on a screen 106 in FIG. 1A. Image 107 is also referred to as a handheld camera captured image, or a natural image or a real world image, to distinguish it from an image formed by an optical scanner from a document that is printed on paper (e.g. scanned by a flatbed scanner of a photocopier).
Recognition of text in image 107 (FIG. 1A) may be based on identification of regions (also called “blobs”) that differ from surrounding pixels in one or more properties, such as intensity and/or color. Several such regions are identified in the prior art as maximally stable extremal regions or MSERs. MSERs may be used as connected components, which may be subject to on one or more geometric tests, to identify a rectangular portion 103 of image 107 (FIG. 1A) which includes such a region, as a candidate to be recognized as a character of text. The rectangular portion 103 may be sliced or segmented into one or more blocks, such as block 121 (FIG. 1B) that is a candidate for recognition, as a character of text.
Block 121 which is to be subject to recognition may be formed to fit tightly around an MSER (e.g. so that each of four sides of the block touch a boundary of the region). In some examples a rectangular portion 103 (FIG. 1J) in an image is first divided into a top strip 191, a header line (also called “shiro-rekha”) 192, and a bottom strip 193, to extract therefrom a core strip 194. A region in core strip 194 may be then divided into one or more blocks in contact with one another, such as block 121, based on one or more tests that may indicate presence of multiple characters that form a word of text. The tests to obtain a block 121 for recognition from image 107 maybe based on use of one or more properties of a predetermined script in which text to be recognized is printed, e.g. as described in an article entitled “Indian script character recognition: a survey” by U. Pal and B. B Choudhuri, Pattern Recognition 37(9): 1887-1899 (2004), or as described in another article entitled “Offline Recognition of Devanagari Script: A Survey” by R. Jayadevan, Satish R. Kolhe, Pradeep M. Patil, and Umapada Pal, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, November 2010, each of which is incorporated by reference herein in its entirety.
A block 121 that is a candidate for recognition may be further divided up, by use of a predetermined grid (FIG. 1B), into unitary sub-blocks 121A-121Z (wherein A≦I≦Z, Z being the number of sub-blocks, e.g. 20), each sub-block 121I (FIG. 1C) containing N pixels with one of two binary values, namely value 0 or value 1.
Optical character recognition (OCR) methods of the prior art originate in the field of document processing, wherein a document's image obtained by use of a flatbed scanner contains a series of lines of text (e.g. 20 lines of text). Such prior art OCR methods may extract a vector (called “feature vector”) from binary values of pixels in each sub-block 121I. Feature vectors Z in number are sometimes obtained for a block 121 that is subdivided into Z sub-blocks, and these Z vectors may be stacked to form a block-level vector that represents the entirety of block 121, and it is this block-level vector that is then compared with a library of reference vectors generated ahead of time (based on training images of letters of an alphabet to be recognized). Next, a letter of an alphabet which is represented by a reference vector in the library that most closely matches the vector of block 121 is identified as recognized, so as to conclude the OCR (“document” OCR) of a character in block 121 in portion 103 of a document's image.
One feature vector of such prior art has four dimensions, each dimension representing a gradient, based on a count of transitions in intensity, between the two binary values along a row or a column in a sub-block. Specifically, two dimensions in the feature vector keep count of black-to-white and white-to-black transitions in the horizontal direction (e.g. left to right) along a row of pixels in the sub-block, and two additional dimensions in the feature vector keep count of black-to-white and white-to-black transitions in the vertical direction (e.g. bottom to top) along a column of the sub-block. Exactly four counts are formed. In forming the four counts, block 121 is assumed to be surrounded by a white boundary, and any transition at the boundary is counted as a half transition. These four counts are divided by total number of pixels N in each sub-block, even though the sum of these four counts does not add up to N.
In the example shown in FIG. 1B, block 121 is subdivided into twenty (Z=20) sub-blocks, and each sub-block has its own vector of four dimensions. For example, traversing pixels in a horizontal direction from left to right (see sub-block 121I of FIGS. 1C and 1D) yields two values: zero (0) zero-to-one transitions, and three (3) one-to-zero transitions (assuming a column 121Z of zero intensity pixels at the right boundary of sub-block 121I). Traversing pixels in a vertical direction from bottom to top (see sub-block 121I in FIG. 1E) yields the following values: one (1) zero-to-one transition and a zero (0) one-to-zero transition (assuming a row 121J of (1, 1, 0) intensity pixels at the bottom boundary of sub-block 121I).
Hence, a histogram of the above-described intensity transitions in sub-block 121I has the following four values (0, 3, 1, 0), as shown in FIG. 1F, wherein the first two values are generated by horizontal traversal and the last two values are generated by vertical traversal. As there are N=9 pixels in sub-block 121I of this example, a vector 121V is formed (see FIG. 1F) by dividing the counts with this number N, as follows: (0/9, 3/9, 1/9, 0/9). Formation of a similar four element vector for sub-block 121K is illustrated in FIGS. 1G, 1H and 1I. Similar four element vectors are formed for all remaining sub-blocks, and then the vectors for all sub-blocks are stacked (or concatenated) to form a block-level vector for the entirety of block 121, which therefore has a total of 4×Z e.g. 80 elements (also called “dimensions”). This 80 element vector for block 121 may then be used, in comparison with reference vectors in a library, to identify a letter of text therein.
In some prior art methods, an 80 dimension vector of the type described above is compared with reference vectors (each of which also has 80 dimensions) in the library, by use of a Euclidean distance metric (square root of squares of difference in each dimension), or a simplified version thereof (e.g. sum of absolute value of difference in each dimension). One issue that the current inventors find in use of such distance metrics to identify characters is that the above-described division by N, which is used to generate a four dimensional vector 121V as described above, affects accuracy because the sum of the four elements prior to division by N does not add up to N (and, in the example shown in FIGS. 1C-1F, the sum 0+3+1+0 is 4, which is not same as 9).
Moreover, the current inventors note that ambiguity can arise in use of four counts to represent nine pixels, which can increase the difficulty in recognizing (from a handheld camera captured image), letters of an alphabet whose rules permit ambiguity, such as Devanagari wherein, for example, a left half portion of a letter can be combined with another letter, and/or a letter may or may not have an accent mark at the bottom or the top of that letter, etc. Furthermore, the current inventors note that use of just four counts may be insufficient to represent details necessary to uniquely characterize regions of text, in certain scripts such as Devanagari that have a large number of characters in their alphabet. Therefore, the current inventors believe that use of an 80 element feature vector (obtained by cascading groups of 4 counts for 20 sub-blocks) can result in false positives and/or negatives that render prior art techniques impractical.
Hence, the current inventors believe there is a need for a new vector that is more representative of pixels in the image, and use of the new vector with a new comparison measure that provides a better match to a reference vector in a library, as described below.