The present invention relates to an apparatus and method for performing document structure analysis and, more particularly, to an apparatus and method for performing document structure analysis based on the layout and textual features of the document.
Document image processing is a crucial process in the field of office automation. Document image processing begins with the optical character recognition (OCR) phase where a computer and optical imaging system are used to optically scan a paper document to acquire optical image data, convert the optical image data into electrical image data, and process the electrical image data to determine the content of the document. Known commercial OCR systems are capable of segmenting the electrical image data into blocks, lines and words upon recognizing these features. Office automation involves automating the tasks of processing, filing and retrieving documents to increase productivity of the work environment. Therefore, document analysis and understanding are crucial activities for integrated office automation systems.
Document processing generally can be divided into two stages, namely, document analysis and document understanding. Documents can normally be viewed as comprising a geometric (i.e., layout) structure and a logical structure. Extraction of the geometric structure from a document is generally referred to as document analysis. Mapping the geometric structure into a logical structure is generally referred to as document understanding.
Despite major advances in computer technology, the degree of automation in acquiring data and understanding the acquired data continues to be very limited. Most existing document analysis systems are restricted to relatively small application domains. Even though some systems may be adaptable to new application domains, the adaptation is time consuming, and may be as time consuming as developing an entirely new system suitable for the new application domain. A need exists for a system that can be easily adapted to meet the needs of many application domains and that provides a high degree of office processing automation capability. Such a system would need to have the ability to maximize the use of document structure, in terms of both geometrical and logical structure, in analyzing documents. Also, such a system would need to have an exchangeable rule base to allow the system to be easily adapted to a new application domain.
To date, logical layout analysis has not received as much attention as geometrical layout analysis, although a few methods for page understanding have been proposed. The proposed techniques can be classified into two primary classes, namely, methods based on tree transformations and methods based on formatting knowledge. Techniques belonging to the first class attempt to modify the geometrical layout tree by moving nodes in the tree and labeling each node with an indicator of the appropriate logic class according to specific sets of rules. An example of such a method is known as the xe2x80x9cMulti-Articled Understandingxe2x80x9d approach. A different approach that also falls within this class utilizes preliminary knowledge of the page layout in order to optimize, based on document features, the logic rules that are to be applied. In contrast to these tree-transformation approaches, formatting-knowledge methods are based on, for example, the application of syntactic grammar analysis, the characterization of the most relevant blocks in a page, the application of the macrotypo-graphical analysis, etc. These approaches to logical layout analysis have proven to be only marginally successful, at best.
Accordingly, a need exists for a method and an apparatus for performing document structure analysis that overcome the disadvantages of prior document structure analysis techniques.
The present invention is directed to a method and an apparatus for performing document analysis. The apparatus of the present invention comprises logic configured to recognize and label structures in a document that are both common to multiple types of documents and that are unique to the particular type of document being analyzed. The logic preferably is a computer that receives the output of an optical character recognition (OCR) system and then analyzes the output in accordance with a document structure analysis routine. For structures that are common to multiple types of documents, various types of tests may be performed by the document structure analysis routine to recognize and label the common types of structures. In order to recongize structures that are unique to the document, the document structure analysis routine utilizes a rule base that is adapted to the particular application domain to analyze structures in the document. The rule base comprises a plurality of rules for testing structures in the document in order to recognize unique, or application-domain-dependent, structures.
In accordance with the preferred embodiment of the present invention, each rule of the rule base includes a file attribute portion, a rule unit portion and a rule logic portion. The rule unit portion associated with a particular rule defines self-related attributes and cross-related attributes associated with the particular rule. The self-related attributes correspond to particular features of stuctures of a document being analyzed. The cross-related attributes correspond to relationships between particular features of structures of the document being analyzed. The rule logic associated with a particular rule comprises a particular logical expression associated with the particular rule. Preferably, each rule is defined by only one rule logic and by at least one rule unit, and each rule unit is associated with no more than one rule logic.
Each document to be analyzed is comprised of at least one physical block. Each physical block is analyzed and is assigned a label and a likelihood indicator. The label indicates that the physical block corresponds to a particular type of physical block and the likelihood indicator indicates the probability that the label that has been assigned to the block is correct. These labels and their associated likelihood indicators may then be used to identify the application-domain-dependent structures in the document.
These and other features and advantages of the present invention will become apparent from the following description, drawings and claims.