1. Field of the Invention
The present invention relates to an optical character recognition system, and more particularly to methods and apparatuses for scanning and storing images of documents in a computer, for segmenting images of the document into text and non-text blocks, and for determining the identity of characters in the text blocks.
2. Description of the Related Art
In recent years, it has become possible to scan in paper copies of documents so as to form computerized images of such documents, and to analyze images in text areas of the document so as to recognize individual characters in the text data and to form a computer readable file of character codes corresponding to the recognized characters. Such files can then be manipulated in wordprocessing, data-compression, or other information processing programs, and can also be used to retrieve the images of the documents in response to a query-based search of the text data. Such systems, which are hereinafter referred to as "character recognition systems", are advantageous because they eliminate the need to re-type or otherwise re-enter text data from the paper copies of the documents. For example, it is possible to recognition-process a document which has been transmitted by facsimile or reproduced from microfilm or by a photocopier so as to form computer text files that contain character codes (for example, ASCII character codes) of the characters and the numerals in the document.
Conventional character recognition systems scan the paper copy of the document to form a binary image of the document. "Binary image" means that each pixel in the image is either a binary zero, representing a white area of the document or, a binary one, representing a black area. The binary image (or "black-and-white image") is thereafter subjected to recognition processing so as to determine the identity of characters in text areas of the document.
It has recently been discovered that recognition accuracy can be improved dramatically if the paper document is scanned to form a gray-scale image of the document. "Gray-scale" means that each pixel of the document is not represented by either a binary one or a binary zero, but rather is represented by any one of more than two intensity levels, such as any one of four intensity levels or 16 intensity levels or 256 intensity levels. Such a system is described in commonly-assigned application Serial No. 08/430,109 now pending which is a continuation of Ser. No. 08/112,133 filed Aug. 26, 1993 now abandoned, "OCR Classification Based On Transition Ground Data", the contents of which are incorporated herein by reference as if set forth in full. In some cases, using gray-scale images of documents rather than binary images improves recognition accuracy from one error per document page to less than one error per 500 document pages.
FIG. 1 illustrates the difference between binary images and gray-scale images, and assists in understanding how the improvement in recognition accuracy, mentioned above, is obtained. FIG. 1(a) illustrates a character "a" over which is superimposed a grid 1 representing the pixel resolution with which the character "a" is scanned by a photosensitive device such as a CCD array. For example, grid 1 may represent a 400 dot-per-inch (dpi) resolution. A binary image of character "a" is formed, as shown in FIG. 1(b), by assigning to each pixel a binary one or a binary zero in dependence on whether the character "a" darkens the photosensitive device for the pixel sufficiently to activate that pixel. Thus, pixel 2a in FIG. 1(a) is completely within a black portion of character "a" and results in black pixel 2b in FIG. 1(b). On the other hand, pixel 3a is completely uncovered and results in white pixel 3b. Pixel 4a is partially covered but insufficiently covered to activate that pixel and therefore results in white pixel 4b. On the other hand, pixel 5a is covered sufficiently so as to activate it and results in black pixel 5b.
FIG. 1(c) shows a gray-scale image of the same character "a". As shown in FIG. 1(c), pixels which are completely covered (2a ) or uncovered (3a ) result in completely black or white gray-scale levels, the same as in FIG. 1(b). On the other hand, pixels which are partially covered are assigned a gray level representing the amount of coverage. Thus, in FIG. 1(c) which shows a four-level gray-scale image, pixel 4c receives a low gray-scale value and pixel 5c receives a higher gray-scale value due to the relative coverage of pixels 4a and 5a, respectively. Thus, because of an artifact of the scanning process, an original black and white document, as shown in FIG. 1(a), can be scanned into a gray-scale image as shown in FIG. 1(c) with gray-scale values being assigned primarily at character edges and being dependent on coverage of the pixels.
A comparison of FIGS. 1(b) and 1(c) shows that there are additional details in FIG. 1(c), especially at character edges. This additional detail is primarily responsible for improved recognition accuracy.
A problem still remains, however, in how to extract individual gray-scale images of characters from a gray-scale image of a document so as to send the individual gray-scale character image for recognition processing. More particularly, recognition accuracy depends greatly on the ability to determine where one character begins and another ends so that only a single character, rather than a group of characters, is subjected to recognition processing.
FIG. 2 illustrates this situation and shows a page of a representative document. In FIG. 2, a document 10 is arranged in two-column format. The document includes title blocks 12 which include text information of large font size suitable for titles, a picture block 13 which includes a color or halftone picture, text blocks 14 which include lines of individual characters of text information, a graphic block 15 which includes graphic images which are non-text, a table block 16 which includes tables of text or numerical information surrounded by non-text borders or frames, and caption blocks 17 which include text information of small font size suitable for captions and which are normally associated with blocks of graphic or tabular information.
When document 10 is scanned to form a gray-scale image of the document, prior to recognition processing, it is necessary to determine which areas of the gray-scale image are text areas and which are non-text areas, and also to determine, for the text areas, where individual characters are located. This processing is hereinafter referred to as "segmentation processing". Only after segmentation processing has located individual characters can the images of those characters be subjected to recognition processing so as to identify the characters and to form a text file of the characters.
Conventional segmentation processing techniques for binary images are generally unsatisfactory in that they do not accurately separate text from non-text areas and they do not accurately identify the location of individual characters in the text areas. Moreover, for gray-scale images, no segmentation processing techniques are currently known. Furthermore, as compared to binary images, gray-scale images geometrically increase the amount of data needed to store a single document pages, which places tremendous demands on the storage capacity of any gray-scale document system.