1. Field of the Invention
The present invention relates to a system for analyzing image data of a document page utilizing a block selection technique, and in particular to a block selection system which is capable of identifying and extracting a text component attached to a frame within a document page.
2. Incorporation by Reference
U.S. patent application Ser. No. 08/596,716, "Feature Extraction System For Skewed and Multi-Orientation Document", and U.S. patent application Ser. No. 08/514,252, "Feature Extraction System", are hereby incorporated by reference.
3. Description of the Related Art
Recently developed block selection techniques, such as the techniques described in the aforementioned U.S. Patent Applications, are used in page analysis systems in order to identify and analyze different types of image data within a document page. The identification and analysis results are then used to determine a type of processing to be performed on the image data, such as optical character recognition (OCR), data compression data routing, etc. For example, image data which is designated as text data is subjected to OCR processing, whereas image data which is designated as picture data is not subjected to OCR processing. As a result, different types of image data can be automatically input and properly processed without an operator's intervention.
The operation of a block selection technique will be generally described below with respect to FIGS. 1-3. FIG. 1 shows page 101 of a representative document. Page 101 is arranged in a two column format and includes title 102, horizontal line 104, several text areas 105, 106 and 107, which include lines of text data, half-tone picture data 108, which includes a graphic image which is non-text, table 110, which includes text information, framed area 116, half-tone picture area 121 accompanied by caption data 126 and picture areas 132 and 135 accompanied by caption data 137. A block selection technique attempts to define each area of page 101 in accordance with the type of image data therein. As the block selection technique defines each area, a hierarchical tree structure is created, shown in FIG. 2.
Hierarchical tree structure 200 of FIG. 2 contains a plurality of nodes, each of which represents an identified area, or block, of image data. Each node of the tree contains feature data which defines the features of its corresponding block of image data. For example, the feature data may include block location data, attribute data (specifying image type, such as text, picture, table etc.), sub-attribute data, and child node or parent node pointers. Child, or "descendant" nodes represent image data which exist entirely within a larger block of image data. A child node is depicted in hierarchical tree structure 200 as a node branching from a parent node. For example, the text blocks within frame 116 are depicted in the hierarchical tree structure as nodes 214 and 216, which branch directly from parent node 212, which represents frame 116. In addition to the feature data described above, a node which represents a text block may also contain feature data defining the block's reading orientation and reading order. These data are useful when performing OCR processing on a page's text blocks.
In conventional block selection techniques, text blocks are often mis-identified in cases where text data lies adjacent to or overlaps other data. This problem is often encountered when processing table images contained in a document image. Due to the small size of table-cell frames, text circumscribed by one of these frames often is "attached" to a side of the frames. Accordingly, this text is identified as part of the frame, as a picture image, or as noise which is subsequently ignored by a block selection technique. Because the text is not identified as a text block, the text block is not subjected to OCR processing and the text characters within the block are therefore not accessible to a text editor. Furthermore, the reading order of the document's remaining text blocks will be assigned without consideration of the mis-identified text block. Therefore, because the reading order is mis-assigned, even the properly identified text blocks will be improperly processed. There is, therefore, a need to provide a block selection technique which is capable of identifying and extracting text data which is attached to a table-cell frame.