1. Field of the Invention
The present invention relates to page segmentation systems for classifying data within specific regions of a document image. In particular, the present invention relates to a block selection system for identifying table images in a document image and for identifying features within the table images.
2. Incorporation by Reference
Commonly-assigned U.S. patent applications Ser. No. 07/873,012, now U.S. Pat. No. 5,680,479, entitled xe2x80x9cMethod and Apparatus For Character Recognitionxe2x80x9d, Ser. No. 08/171,720, now U.S. Pat. No. 5,588,072, entitled xe2x80x9cMethod and Apparatus For Selecting Text And/Or Non-Text Blocks In A Stored Documentxe2x80x9d, Ser. No. 08/338,781, entitled xe2x80x9cPage Analysis Systemxe2x80x9d, Ser. No. 08/514,250, now U.S. Pat. No. 5,774,579, entitled xe2x80x9cBlock Selection System In Which Overlapping Blocks Are Decomposedxe2x80x9d, Ser. No. 08/514,252, now U.S. Pat. No. 5,848,186, entitled xe2x80x9cFeature Extraction Systemxe2x80x9d, Ser. No. 08/664,675, entitled xe2x80x9cSystem For Extracting Attached Textxe2x80x9d, and Ser. No. 09/002,684, entitled xe2x80x9cSystem For Analyzing Table Images,xe2x80x9d are herein incorporated as if set forth in full.
3. Description of the Related Art
A conventional page segmentation system can be applied to a document image in order to identify data types contained within specific regions of the document image. The identified types can then be used to extract data of a particular type from a specific region of the document image and to determine a processing method to be applied to the extracted data.
For example, using conventional systems, data identified as text data is extracted from a specific region of a document and subjected to optical character recognition (OCR) processing. Results of the OCR processing are stored in ASCII code along with information regarding the location of the specific region. Such storage facilitates word processing of the text data as well as subsequent reconstruction of the document. In addition, conventional systems can be used to extract data identified as graphics data, subject the extracted data to image compression, and store the compressed data along with location information. In sum, conventional page segmentation systems allow automatic conversion of bit-mapped image data of a document to an appropriate format, such as ASCII, JPEG, or the like, and also allow substantial reconstruction of the bit-mapped image.
One specialized example of such page segmentation concerns table images within a document. Once a table image is identified, processing such as that described in above-mentioned U.S. Pat. No. 5,848,186 or U.S. patent application Ser. No. 09/002,684 can be used to identify rows and columns within the table, to extract text data within individual table cells defined by the rows and columns, and to subject the extracted text data to OCR processing. As a result, table image data located within a document image can be automatically input to a spreadsheet application in proper row/column format.
The above-described systems are designed to recognize a standard-format table image having a solid frame and solid horizontal and vertical lines defining rows and columns within the table image. Accordingly, in a case that a table image contains broken or dotted grid lines, or contains no grid lines at all, the above systems are not likely identify the image as a table. Rather, the table is likely determined to be a region of text or a line drawing. Consequently, row/column information is not determined, nor are individual cells within the table associated with row/column addresses.
The present invention addresses the foregoing by providing identification of a table image in a document in which grid lines of the table image are broken, dotted, or otherwise incomplete. An additional aspect of the present invention provides output of text block coordinates and coordinates of areas roughly corresponding to individual table cells within the identified table. Advantageously, such information can be input to a table feature identification system to identify table columns, rows, or other features.
In one specific aspect, the invention is a system for identifying a table image in a document image which includes identification of a frame image in the document image, identification of white areas within the frame image, identification of broken lines within the frame image, calculation of horizontal and vertical grid lines based on the identified white areas and the identified broken lines, and determination of whether the frame is a table image based on the calculated horizontal and vertical grid lines. Beneficially, the identified table image can then be subjected to table-specific processing.
As described above, conventional page segmentation systems often misidentify a table image which does not contain a full set of horizontal and vertical grid lines. The present invention can also be utilized in such cases to properly identify and process the table image. According to this aspect, the present invention relates to a system for processing a region as a table image in a block selection system for identifying regions of a document image. The invention includes acceptance of user input indicating that a region of a document image is a table image, identification of white areas within the region, identification of broken lines within the region, and calculation of horizontal and vertical grid lines based on the identified white areas and the identified broken lines. As a result of the foregoing features, table information is obtained corresponding to the region, and can be used to further analyze the region for table features such as rows, columns or the like.
This brief summary has been provided so that the nature of the invention may be understood quickly. A more complete understanding of the invention can be obtained by reference to the following detailed description of the preferred embodiments thereof in connection with the attached drawings.