1. Field of the Invention
The present invention relates to a page analysis system for analyzing image data of a document page by utilizing a block selection technique, and particularly to such a system in which blocks of image data are classified based on characteristics of the image data. For example, blocks of image data may be classified as text data, titles, half-tone image data, line drawings, tables, vertical lines or horizontal lines.
2. Incorporation by Reference
U.S. patent applications Ser. No. 07/873,012, xe2x80x9cMethod And Apparatus For Character Recognitionxe2x80x9d, Ser. No. 08/171,720, xe2x80x9cMethod And Apparatus For Selecting Text And Or Non-Text Blocks In A Stored Documentxe2x80x9d, Ser. No. 08/596,716, xe2x80x9cFeature Extraction System For Skewed And Multi-Orientation Documentsxe2x80x9d, and Ser. No. 08/338,781, xe2x80x9cPage Analysis Systemxe2x80x9d, which are commonly owned by the assignee of the present invention, are incorporated herein by reference.
3. Description of the Related Art
Recently developed block selection techniques, such as the techniques described in the aforementioned U.S. patent application Ser. Nos. 07/873,012 and 08/171,720, are used in page analysis systems to provide automatic analysis of image data within a document page. In particular, these techniques are used to distinguish between different types of image data within the page. The results of such techniques are then used to choose a type of processing to be subsequently performed on the image data, such as optical character recognition (OCR), data compression, data routing, etc. For example, image data which a block selection technique has designated as text data is subjected to OCR processing, whereas image data which is designated as picture data is subjected to data compression. Due to the foregoing, various types of image data can be input and automatically processed without requiring user intervention.
Block selection techniques are most beneficial when applied to composite documents. FIG. 1 shows an image of composite document page 1 as it appears after being subjected to a block selection technique. Document page 1 includes a logo within block 2, a large font title within blocks 3 to 6, large font decorative text within block 7, text-sized decorative font within blocks 8 to 13, various text-sized symbols within blocks 14 to 27 and a small symbol pattern within blocks 28 to 35.
Block selection techniques use a xe2x80x9cblockedxe2x80x9d document image such as that shown in FIG. 1 to create a hierarchical tree structure representing the document. FIG. 2 shows a hierarchical tree which represents document page 1. The tree consists of root node 101, which represents document page 1, and various descendent nodes. Descendent nodes 102, 102, 104 to 106, 107, 108 to 113, 114 to 127 and 128 to 145 represent blocked areas 2, 3 to 6, 7, 8 to 13, 14 to 27 and 28 to 35, respectively.
In order to construct such a tree, block selection techniques such as those described in U.S. patent application Ser. Nos. 07/873,012 and 08/171,720 search each area of document page 1 to find xe2x80x9cconnected componentsxe2x80x9d. As described therein, connected components comprise two or more pixels connected together in any of eight directions surrounding each subject pixel. The dimensions of the connected components are rectangularized to create corresponding xe2x80x9cblockedxe2x80x9d areas. Next, text connected components are separated from non-text connected components. The separated non-text components are thereafter classified as, e.g., tables, half-tone images, line drawings, etc. In addition, block selection techniques may combine blocks of image data which appear to be related in order to more efficiently process the related data.
The separation and classification steps are performed by analyzing characteristics of the connected components such as component size, component dimension, average size of each connected component, average size of internal connected components and classification of adjacent connected components. However, despite using complex algorithms in conjunction with the foregoing factors in order to classify blocks of image data, block selection techniques often mis-identify or are unable to identify blocks of data within a document page.
For example, as shown in FIG. 2, a conventional block selection technique may not be able to distinguish the content of blocks 2, 3 and 7 of page 1. Accordingly, corresponding nodes 102, 103 and 107 are designated xe2x80x9cunknownxe2x80x9d.
These problems occur because the classification algorithms applied by conventional block selection techniques are premised on many assumptions relating to data size, e.g., any data which falls within a given size threshold is classified as text data. Accordingly, any text data outside of that threshold will most likely not be characterized as text data. Also, text and non-text connected components are separated based on an assumption that text connected components are usually smaller than picture connected components. In addition, the algorithms also assume that text connected components comprise the majority of the connected components in a document page.
Accordingly, conventional block selection techniques are inherently inaccurate because they rely on assumptions regarding size-related characteristics of document image data and do not attempt to actually recognize the content of the image data.
Mis-identification of document image data due to these inherent inaccuracies results in significant problems when combining related blocks of image data. For example, the combining algorithm used in the present example requires that blocks which a block selection technique has designated as xe2x80x9cunknownxe2x80x9d be combined with any adjacent text blocks. Accordingly, because xe2x80x9cunknownxe2x80x9d blocks 2 and 3 of document page 1 are adjacent to xe2x80x9ctextxe2x80x9d blocks 4 to 6, these blocks are grouped together to form xe2x80x9ctextxe2x80x9d block 36, shown in FIG. 3. Therefore, the logo within original block 2 will be mistakenly processed as text. As also shown in FIG. 3, blocks 7 to 13, 14 to 27 and 28 to 35 are combined into single xe2x80x9ctextxe2x80x9d blocks 38, 39 and 40, respectively.
Techniques have been developed to address the tendency of existing block selection techniques to mis-identify and/or erroneously combine image data. For example, U.S. patent application Ser. No. 08/361,240 describes a method for reviewing the data classifications resulting from a block selection technique and for editing the classifications in the case that any image data was misidentified by the block selection technique. However, such techniques require operator intervention and are therefore not adequate in cases where automation of the block selection technique is required.
The present invention relates to a method for classifying blocks of image data within a document page which utilizes optical character recognition processing to address shortcomings in existing block selection techniques.
Thus, according to one aspect of the invention, the present invention is a method for increasing the accuracy of image data classification in a page analysis system for analyzing image data of a document page. The method includes inputting image data of a document page as pixel data, analyzing the pixel data in order to locate all connected pixels, rectangularizing connected pixel data into blocks, analyzing each of the blocks of pixel data in order to determine the type of image data contained in the block, outputting an attribute corresponding to the type of image data determined in the analyzing step, and performing optical character recognition so as to recognize the type of image data in the block of image data in the case that the analyzing step cannot determine the type of image data contained in the block.
In another aspect, the present invention is a method for accurately classifying image data in a page analysis system for analyzing image data of a document page. The method includes inputting image data of a document page as pixel data, combining and rectangularizing connected pixel data into blocks of image data, and analyzing and classifying the data as a type of data. In the case that the type of data is indicated as text data and a size of the text data is outside a predetermined size threshold, the method further comprises performing optical character recognition on the text data.
This brief summary has been provided so that the nature of the invention may be understood quickly. A more complete understanding of the invention can be obtained by reference to the following detailed description of the preferred embodiments in connection with the attached drawings.