This application is being filed with a microfiche appendix of computer program listings consisting of three (3) fiche having 269 frames.
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
1. Field of the Invention
The present invention relates to a method and apparatus for processing character blocks prior to e.g., character recognition, and particularly to such a method and apparatus in which, prior to recognition, blocks of image data are classified and selected based on the characteristics of the image data. For example, blocks of image data may be selected and classified based on whether the image data is text image data (horizontal and/or vertical) or non-text image data such as halftone (or grey-scale) images, line drawings, tables, vertical or horizontal lines, vertical or horizontal slanting lines, frames or the like.
2. Description of the Related Art
In recent years, it has become possible to analyze images of text data so as to recognize individual characters in the text data and to form a computer-readable file of character codes corresponding to the recognized characters. Such files can then be manipulated in word-processing, data-compression, or data-processing programs. Such systems, which are hereinafter referred to as "character recognition systems" are advantageous because they eliminate the need to re-type or otherwise re-enter text data. For example, it is possible to character-recognize a document which has been transmitted by facsimile or reproduced from microfilm or by a photocopier so as to form computer text files that contain character codes (e.g., ASCII) of the characters and numerals in the document, thereby permitting further word-processing or data-processing of the document without the need to re-type or re-enter the document.
Documents to be character-recognized often contain many different types of image data, not all of which can be recognized. For example, while it is currently possible to recognize text image data, it is very difficult to recognize non-text image data. Typically, documents to be character-recognized include blocks of text image data, and blocks of non-text image data such as halftone images, line drawings, lines and the like. In addition, the documents may include tables or tabularly arranged data which may or may not be framed. Accordingly, before character recognition processing, it is necessary for individual blocks in the document to be classified in accordance with the type of image data in the blocks and for text-type blocks to be selected from the image data.
FIG. 1 shows a page of a representative document. In FIG. 1, a document page 101 is arranged in a two-column format. The page includes title blocks 102 which include text information of large font size suitable for titles, text blocks 104, which include lines of text data, graphics block 105 which includes graphic images which are not text, table block 106 which includes a table of text or numerical information, and caption blocks 107 which include small sized text data and which are captions associated with blocks of graphic or tabular information. Each block of information is to be classified in accordance with the type of information contained therein and the blocks are then segmented based on that classification.
Previously, to detect text-type blocks of image data, it has been considered to smear the pixel image data horizontally and vertically by extending blackened pixels in the image data both horizontally and vertically into one or more adjacent white pixels. Smearing techniques like these are unsatisfactory because they rely on foreknowledge of characteristics of the text-type image data (for example, font size) so as to be able to choose smearing parameters properly. Moreover, small changes in smearing parameters can produce large changes in selection results. Smearing techniques are also not always able to preserve the internal structure of the original document. For example, smearing can cause a two-column original to be smeared into a single column. Such a situation is unsatisfactory because it jumbles the order in which text data is stored making it impossible to reconstruct the original text accurately. Moreover, it has been found that smearing techniques sometimes smear text-type data into non-text-type data and cause the entire region to be erroneously interpreted as text-type data.
U.S. patent application Ser. No. 07/873,012, filed Apr. 24, 1992, and commonly assigned, proposes another technique for selecting character blocks in a stored document. Therein, the stored document is first searched to find so-called "connected components", which may comprise two or more pixels connected together in any of the eight directions surrounding each pixel. Next, the text connected components are separated from the non-text connected components, and the non-text components are classified as, e.g., tables, halftone images, line drawings, etc. Next, the direction of any skew in the document is detected, and if the skew is vertical, the image is rotated ninety degrees and the connected components are again searched. After correction of the skew, invisible white lines along the edge of non-text components are searched for so that appropriate blocks of text, e.g., columns, can be identified. Thereafter, the horizontal text lines and title lines are formed, and the horizontal text lines are grouped into rectangularly-arrayed text blocks. Thereafter, post processing is performed to prepare the identified text blocks for further character recognition processing. Ser. No. 07/873,012 is incorporated herein by reference.
While the above-described block selection technique may be appropriate for horizontal documents, (e.g., English-language documents) it is possible for a page to contain both horizontal and vertical text blocks (bi-directional) For example, a Japanese document may contain vertical Kanji characters in combination with horizontal characters such as tables and figure legends. Also, certain English documents include vertically-extending characters in order to highlight certain information or to provide some desired effect.
Furthermore, the scanned page itself is often skewed, and the above-described block selection technique handles this problem by first identifying the skew and then rotating the image before the formation of the text block. Speed and accuracy become two practical problems when utilizing this technique. In more detail, in the block-selection technique described above, the block area is represented by a rectangle, and the boundaries of the non-text blocks are also recorded. However, in the case of a skewed document, the rectangles may obscure the separation between the text blocks and may actually overlap. This leads to misclassification of data in the blocks and may lead to errors in character recognition.
Therefore, what is needed is a method and apparatus for effectively and efficiently selecting text and non-text blocks in a stored document in which both vertical and horizontal text blocks may be recognized, and in which a skewed document is not required to be rotated prior to the formation of the text blocks. This provides a much more flexible block selection technique while saving processing time and increasing recognition accuracy.