1. Field of the Invention
The present invention relates to an apparatus, a method, and a storage medium for extracting a table region from a document image.
2. Description of the Related Art
Hitherto, there have been region segmentation techniques for analyzing a document image and segmenting the document image into regions classified according to attributes, such as “text”, “graphics”, “line-drawing”, and “table”. U.S. Pat. No. 5,680,479 discusses such a region segmentation technique. According to this region segmentation technique, first, a black pixel connected component is extracted from a binarized document image by performing 8-direction contour tracing of black pixels. Then, a white pixel connected component (hereinafter referred to as an internal region) is extracted from the black pixel connected component by performing 4-direction contour tracing of white pixels. Finally, the binarized document image is segmented into regions classified according to the attributes, such as “text”, “graphics”, and “table”.
Each of the attributes of the segmented regions can be used to determine a type of processing, such as optical character recognition (OCR), data compression, data routing, or data extraction, to be subsequently performed on an image formed in each of the regions. For example, the OCR processing is performed on an image formed in the text region. However, the OCR processing is not performed on images formed in a picture region, a figure region, and the like. Thus, an image processing apparatus can be configured so that even when a plurality of different types of document images are input thereto utilizing such a system, the input images are automatically processed without operator's intervention.
When the region segmentation technique discussed in U.S. Pat. No. 5,680,479 is applied to a document image illustrated in FIG. 3, a result of the region segmentation is obtained, as illustrated in FIG. 4. A title region, a horizontal line region, a text region, a picture region, a figure region, a frame region, a table region, and the like included in the document image illustrated in FIG. 3 are separated and grouped according to the types (or attributes) of the regions. Thus, each region is extracted, as illustrated in FIG. 4.
On the other hand, when region segmentation processing is performed, sometimes, a region cannot exactly be extracted. For example, when a table and a title are located closely to each other, as illustrated in FIG. 5, black pixels representing the table are connected to black pixels representing the title on a scanned image depending on a scanning condition or a printing condition. In such a case, there is a fear that a region indicated with dashed lines 501 may be regarded as one black pixel connected region (an area of black pixel connected component), and that the title may be identified as a ruled line portion of the table without being identified as a text region.
In the case of a document image in which a table is connected to another element, e.g., the figure of an arrow, as illustrated in FIG. 8, a black pixel connected component is a region 801 in which a table portion is connected to a figure portion. On the other hand, white pixel connected components in the region 801 are irregularly arranged. Thus, the region 801 is not identified as a table region. More particularly, in the case where black pixels other than the tables are connected to one another, as illustrated in FIGS. 5 and 8, sometimes, an error occurs in the extraction of a table region, or a failure occurs in identifying a region as a table region.