The present invention can be used, for example, to reverse-engineer the process of encoding data in structured documents and to subsequently automate the process of extracting it. We assume a broad category of structured documents for processing that goes far beyond form processing. In fact, the documents may have flexible layouts and may include varying numbers of pages. The data extraction method (DataX™) employs general templates generated by the Inductive Template Generator (InTeGen™). The InTeGen™ method utilizes inductive learning from examples of documents with identified data elements. Both methods achieve high automation with minimal user input.
Historically, document conversion systems have focused on two major types of applications: full-text databases and form processing. Examples of full-text applications include correspondence tracking and litigation support. Major form processing applications include health care forms, tax forms, and the U.S. Census. There is a major class of applications that lies between these two major types. These are called flex-forms. Examples include bills of lading, invoices, insurance notifications and mortgage documentation. Such applications typically are characterized by a large number of document variants. Today, most form processing applications are designed to recognize and capture data from a small number of form types, usually fewer than ten. Even the most versatile operations typically process at most a few dozen forms. The flex-form applications, in contrast, typically involve hundreds to thousands of document variants.
Within a single document domain, there may be many different document types. A document type is a category of documents that reflects a particular activity and shares the same subset of data elements. In each document type, the information may have different layouts, or in other words, each document type may be represented using different document structures. Each document may include one or many pages. Document processing may depend on how single pages are handled. Page factors may affect document assembly after scanning of hardcopy documents, and processing of recurrent (header and footer) information. The page layout of a document may be common for a category of documents, enabling certain types of processing, such as page identification and simple data extraction. Documents consisting of pages with unstable layouts need a different type of processing that depends both on page context and a global document context.
Data extraction from flex-form documents (or even from a broader class of context-based documents defined below) is a challenging problem because of the indefinite structure of such documents. Data fields do not have fixed locations and characteristics on a page, and may flow between pages of documents. Existing data extraction products only partially automate the process and further lack flexibility and robustness.
The complete data extraction process includes two major components: document definition and data element extraction. Document definition involves encoding knowledge about documents and storing it in the form of templates, rules, descriptions, etc. Data element extraction applies that knowledge to retrieve actual data from documents. Existing products only partially automate the data definition process. For example, some products require explicit rule coding for finding data elements. Others, through a graphical user interface, support template design, but all data descriptions have to be detected and input by a user. Overall, application of form processing products to more complex documents may require manual preparation of hundreds of templates or rule sets, which makes them prohibitively expensive due to labor costs. The system described below enables, for example, complete automation of the process of data definition through machine learning.
More specifically, the existing data extraction products, such as FormMapper™ from Teraform™ or InScript™ from Captiva Software Corporation™, only partially automate the process and they lack flexibility and robustness. Teraform™'s FormMapper™ requires explicit rule coding for finding data elements: “Rule Sets contain the instructions which FormMapper™ follows to enhance and zone an input document. They are very similar to directions you might give a friend describing how to get to your house—except in the case of documents, these instructions tell FormMapper™ how to find and zone specific data fields.” In a recent patent assigned to Teraform Inc., U.S. Pat. No. 5,852,676 entitled “Method and Apparatus for Locating and Identifying Fields within a Document,” a system that tolerates variations in documents is disclosed. The described system, however, falls short in fully automating template generation. One shortcoming of the system is that it requires a substantial user interaction in the process of characterizing fields for extraction (FIG. 4, step 117 of U.S. Pat. No. 5,852,676). The system relies on a human's capability to generalize and predict. Another shortcoming is that the system employs image segmentation processes (FIG. 4, steps 109, 111 of U.S. Pat. No. 5,852,676) that still may require correction by a user's input (FIG. 7A, step 466 of U.S. Pat. No. 5,852,676). The system completely relies on image processing and user input and does not utilize optical character recognition (OCR) data in the process of identifying fields.
Captiva Software Corporation™, created through the merger of FormWare™ Corporation and Wheb Systems™, Inc., is a provider of information capture software. Captiva™'s product, InScript™, through a graphical user interface, supports template design, but all data descriptions have to be discovered and input by a user.
There have been other patents assigned in the field of processing machine-readable forms (forms understood as one-page documents with a static layout). Among other things, however, the methods presented there are not applicable to the broad class of documents considered in this invention. For example, U.S. Pat. No. 5,748,809 entitled “Active Area Identification on a Machine Readable Form Using Form Landmarks” uses graphical landmarks to locate marks (e.g. checkmarks) on forms. Automated detection of landmarks eliminates the need for a user to identify them. In one embodiment, a similar motivation led to the use of context phrases that can be regarded as textual landmarks. Context phrases guide the search for data elements on a document.
While, superficially, landmarks may be used, there are many important differences between landmarks described in U.S. Pat. No. 5,748,809 and, for example, context phrases of the current invention. First, the landmarks are graphical constructs generated using connected component analysis. On the other hand, context phrases include textual constructs resulting from OCR processing. Moreover, since very limited variability between form instances are assumed in the '809 patent (only distortions introduced by scanners), just one model page is used to generate landmark descriptions. Preferred embodiments of the present invention, however, assume significant variability between the document pages, which is related to the variety of designs of documents that share similar information. Therefore, context phrases are preferably automatically induced from many instances of similar documents. Third, in the '809 patent, the same set of landmarks is used in identifying all active areas, whereas context phrases of the current invention may be constructed individually to emphasize local characteristics of each data element. A fourth difference is evident in the type and the number of measurements that are taken to address an active area (data element). U.S. Pat. No. 5,748,809 uses absolute position information for both active area and landmark to establish their distance. However, one embodiment of the present invention described below, for example, uses three different distance measures, two of which are not applicable in the context of graphical landmark representation.
Another patent, U.S. Pat. No. 5,721,940 entitled “Form Identification and Processing System Using Hierarchical Form Profiles” describes both form identification and data extraction from forms using the form profiles. The form profile is created from a blank form and is limited to storing descriptions of one-page, static layouts. The method is obviously not applicable to flex-forms and template acquisition from multi-page, filled-out documents.
U.S. Pat. No. 5,416,849 entitled “Data Processing System and Method for Field Extraction of Scanned Images of Document Forms” utilizes a “viewpoint structure” which characterizes data elements and surrounding regions. The viewpoint structure that provides a graphical context (“line ends, box ends, crossed lines, and blobs”) for a data element is utilized in filtering out such graphical elements before OCR is done. It is not used in determining the location of a data element. Location of the data element is found in a way typical of form processing, i.e. based on field coordinates and offsets. The only, and distant relation with some embodiments of the current invention is the fact that some kind of context is being stored for use prior to data extraction. Among other things, the method is obviously not applicable to flex-forms.
In a broad sense, preferred embodiments of the present invention are related to the automated discovery of logical structure in text documents. An aim of automated discovery of logical structure may be to create a hierarchy of logical components of a document from given physical instances of the document. Even though this field has been active for over twenty years, none of the existing methods is readily applicable to the subject problem of the current invention. Most of the existing methods target specific types of documents, such as technical papers or office documents. Each approach assumes some prior knowledge of the style of a document. The required information about the document style varies from very specific to general ways of conveying logical information through formatting. Another fundamental distinction in various approaches stems from the relative roles of content and layout in the definition of a logical structure. A first logical structure is more content-oriented than a second logical structure if its definition relies more heavily on internal meaning (wording). A more layout-oriented structure has a definition that relies more heavily on visual presentation. Therefore, in content-oriented methods, OCR data is analyzed and utilized; whereas, in layout-oriented methods, image-processing methods, such as segmentation, projection profiles, and texture analysis, play the major role.