This application is being filed with a microfiche appendix of computer program listings consisting of four (4) fiche having 215 frames.
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
1. Field of the Invention
The present invention relates to a method and apparatus for character recognition, and particularly to such a method and apparatus in which, prior to recognition, blocks of image data are classified and selected based on the characteristics of the image data. For example, blocks of image data may be selected and classified based on whether the image data is text image data or non-text image data such as halftone (or grey-scale) images, line drawings, frames or the like.
The present invention further relates to a method and apparatus in which lines of text are identified and segmented from text blocks and in which individual characters within lines of text are identified and cut from other characters in the lines in preparation for recognition processing.
2. Description of the Related Art
In recent years, it has become possible to analyze images of text data so as to recognize individual characters in the text data and form a computer readable file of character codes corresponding to the recognized characters. Such files can then be manipulated in word-processing or data-processing programs. Such systems, which are hereinafter referred to as "character recognition systems", are advantageous because they eliminate the need to re-type or otherwise re-enter text data. For example, it is possible to character-recognize a document which has been transmitted by facsimile or reproduced from microfilm or by a photocopier so as to form computer text files that contain character codes (e.g., ASCII) of the characters and numerals in the document thereby to permit further word-processing or data-processing of the document without the need to re-type or re-enter the document.
Documents to be character-recognized often contain many different types of image data, not all of which can be recognized. For example, while it is possible currently to recognize text image data, it is not now possible to recognize non-text image data. Typically, documents to be character-recognized include blocks of text image data, and blocks of non-text image data such as halftone images, line drawings, lines and the like. In addition, the documents may include tables or tabularly arranged data which may or may not be framed. Accordingly, before character recognition processing, it is necessary for individual blocks in the document to be classified in accordance with the type of image data in the blocks and for text-type blocks to be selected from the image data.
FIG. 32 shows a page of a representative document. In FIG. 32, a document page 401 is arranged in a two-column format. The page includes title blocks 402 which include text information of large font size suitable for titles, text blocks 404, which include lines of text data, graphics block 405 which includes graphic images which are not text, table block 406 which includes a table of text or numerical information, and caption blocks 407 which include small sized text data and which are captions associated with blocks of graphic or tabular information. Each block of information is to be classified in accordance with the type of information contained therein and the blocks are then segmented based on that classification.
Previously, to detect text-type blocks of image data, it has been considered to smear the pixel image data horizontally and vertically by extending blackened pixels in the image data both horizontally and vertically into one or more adjacent white pixels. Smearing techniques like these are unsatisfactory because they rely on foreknowledge of characteristics of the text-type image data (for example, font size) so as to be able to choose smearing parameters properly. Moreover, small changes in smearing parameters can produce large changes in selection results. Smearing techniques are also not always able to preserve the internal structure of the original document. For example, smearing can cause a two-column original to be smeared into a single column. Such a situation is unsatisfactory because it jumbles the order in which text data is stored making it impossible to reconstruct the original text accurately. Moreover, it has been found that smearing techniques sometimes smear text-type data into non-text-type data and cause the entire region to be erroneously interpreted as text-type data.
After block selection, character recognition processing proceeds character-by-character through the document whereby each individual character in the document is subjected to recognition processing so as to obtain a computer code corresponding to the character. Obtaining individual characters from character blocks proceeds in two general steps.
In the first step, individual lines in each text block, such as title block 202, text blocks 204 and caption blocks 207, are segmented from other lines in the text block. Typically, line segmentation is performed by obtaining horizontal projections of pixel density in each block and inspecting the density projections to identify gaps between lines. Thus, as shown in FIG. 33(a), text block 404 includes text lines 411 between which are located gaps 412. A horizontal projection of pixel density 414 is obtained by summing the number of black pixels located in each row of block 404. Text lines correspond to non-zero areas in density projection 414 while gaps between text lines correspond to zero-valued areas in projection 414. Text lines 411 are segmented from each other in accordance with the density projection.
In the second step, individual characters in segmented text lines are cut from other characters in the text line. Thus, as shown in FIG. 34(a), text line 411 includes individual characters 415. To cut each character from other characters in the text line, a vertical projection of pixel density 416 is obtained by summing black pixels vertically in each column of line segment 411. Characters 415 correspond to non-zero areas of density projection 416 while gaps between characters correspond to zero areas of density projection 416. Individual characters are cut from other characters in the line segment accordingly.
Difficulties have been encountered in the foregoing process. For example, it is commonplace for a document to be fed obliquely past an image scanner so that it is stored in pixel memory at a slant angle .theta.s as shown in FIG. 33(b). In this case, it is not always possible to segment lines because the text from a first line 418 overlaps text from a second line 419 as shown at 420. Accordingly, a horizontal projection of pixel density 421 includes only non-zero values and it is not possible to locate gaps between lines because there are no zero values.
To overcome this difficulty, it has been considered to divide a text block 404 into plural columns 422 and 424 in FIG. 33(c) and to obtain independent horizontal projections for each such column. Thus, as shown in FIG. 33(c), a horizontal projection 422a corresponds to column 422 and a horizontal projection 424a corresponds to column 424. As long as text lines in each column do not overlap, as depicted in FIG. 33(c), it is possible to identify text lines in each column.
Although only two columns are shown in FIG. 33(c), typically five to ten columns are employed so as to guarantee that individual lines can be segmented from other lines in the block even if the test is slanted up to some maximum slant angle .theta.s max. However, since horizontal pixel projections must be obtained for each column, and since each horizontal pixel projection so obtained must be processed separately, line segmentation processing can be quite time consuming. In addition, time is often wasted because, in an effort to accommodate the maximum slant angle .theta.s max, all columns must be processed for all documents even though the slant angle for most documents is small and only one or a few columns would be needed.
Another difficulty encountered with the two step process described above occurs in the second step where individual characters are cut from other characters in line segments. While the processing described above with respect to FIG. 34(a) is satisfactory when there are vertical spaces between characters, the processing is unsatisfactory when the characters overlap vertically or when two or more characters are touching. Such a situation is commonplace for italic fonts or when image quality is degraded through repeated photocopying or through facsimile transmission. Thus, as shown in FIG. 34(b), for italics text the characters "f" and "y" in the word "Satisfy" overlap vertically and the vertical projection of pixel density 425 does not have a zero value between those characters. Accordingly, it is not possible to cut the characters "f" and "y". In addition, the characters "t" and "i" touch and it is not possible to cut between these two characters as well.