The present inventive subject matter relates generally to the art of automated document processing. Particular but not exclusive relevance is found in connection with parsing and/or interpretation of documents, e.g., such as sales receipts, invoices, tables, lists, healthcare forms, etc. The present specification accordingly makes specific reference thereto at times. However, it is to be appreciated that aspects of the present inventive subject matter are also equally amenable to other like applications.
Documents often consist of multiple sub-structures, referred to herein as “items.” For example, a book may consist of multiple paragraphs; an invoice may consist of a header, an itemized list of purchases, and a footer; a healthcare claims form may consist of a multitude of items specifying various information about the patient, insurance coverage, treatment, care provider, etc. Complete document interpretation generally involves finding all or subset of the items and assigning interpretations, or functional roles, to them. These roles supply meaning to the items and allow them to be used in higher-level processing, such as data mining. As an example, an item which contains the number “10.00” without a functional role is not particularly useful, except maybe for text search. The same item annotated or assigned with the role of “price” is much more useful and can be used, for example, for storing in or matching to a database, or for applying business rules to a purchase, etc.
Currently, many individuals and/or businesses may perform the aforementioned parsing and/or interpretation task manually. This can lead to significant costs of document processing. For example, it has been estimated that the cost of processing a single invoice is not insignificant. In some cases, large businesses may process tens of thousands of invoices per day, which can result in disadvantageously high operating costs. Accordingly, some may find it result in disadvantageously high operating costs. Accordingly, some may find it desirable to reduce the manual effort involved in parsing and/or interpreting documents.
Commonly, documents consist of many individual items. One notable hurdle in interpreting such documents is that these items are usually not independent. For example, in many documents no two items may occupy the same region; therefore, the end of one item may determine the beginning of another. Although this interaction may seem somewhat trivial at first, it can present a relatively daunting challenge when item boundaries are ambiguous and/or cannot be detected reliably. Other complex interactions between items may include alignment and common font, or consistent differences in font size used to indicate subordination relations between items. An example of an even more complex interaction is that in invoices, the prices of individual items generally have to sum up to the total amount due.
When interactions between items are present in a document, local decisions about the items can become brittle and unreliable, and it can be beneficial in this case to formulate a problem in terms of optimizing a global objective function. However, when a document interpretation problem is formulated in this manner, optimizing the objective globally by brute force can be infeasible and/or impractical for long documents with many items, and particularly for multi-page documents.
Previously, some approaches have been proposed to accomplish complete document interpretation, which involve detecting the individual items in a document one by one, independently from each other. In one example of such an approach, the items of interest in a document are extracted using tags. For example, the total amount due on an invoice may have the word “TOTAL” as a tag; locating the tag in a document provides a cue for the location of the item of interest. However, in this case, the items are matched independently from each other.
In other previous works, complete documents, as well as sub-structures of interest, are represented by graphs. These graphs encode relations such as adjacency, alignment, and reading order. Sub-graph matching is used to find items of interest. Again, these items are matched independently from each other. It is assumed that the conditions for matching are specified so that multiple hypotheses for each item do not have to be considered. As a result, it is often difficult and/or impractical to specify these conditions accurately enough automatically; accordingly, a significant amount of expert user input may be demanded.
A general criticism of “independent matching” approaches is that local matches can often be ambiguous. In such cases, determining the best overall interpretation of a document, and the globally optimal locations of each item, generally benefits from accounting for the interactions between items. Independent matching approaches tend to perform poorly in these circumstances.
In yet other prior work, an assumption of class-conditional independence has been used. In this work, each document is classified into one of several predefined styles. Within each style, items are assumed to be independent. One drawback of this approach is that mixed styles, and styles that cannot be decomposed into non-interacting items, are generally not allowed. Many real-life documents cannot therefore be interpreted. Another disadvantage is the styles and extraction procedures for each style have to be specified; accordingly, new or unexpected styles generally cannot be handled directly.
Another kind of general approach previously proposed involves segmenting the document into individual items first, and then determining the type (or functional role) of each segment. In accordance with such approaches, it is hoped that the segments indeed correspond to items in a one-to-one manner. In one example of a segmentation-based approach, tables are parsed using alignment and whitespace to detect the item boundaries. In other examples, bottom-up segmentation is first used to detect item boundaries; in some cases, the items are then assigned functional roles using constraint satisfaction. Segmentation is often done greedily for efficiency considerations, although finding a globally optimal segmentation is also possible.
One disadvantage of segmentation-based approaches is that segmentation errors are generally non-recoverable: if a given item is not represented by a single segment, correctly labeling that item becomes can be extremely difficult if not impossible. In contrast, the approach proposed herein avoids local segmentation decisions; as a result, it performs well even when segmentation is ambiguous locally. Some segmentation methods use backtracking to correct certain kinds of segmentation errors, but these decisions are made locally and independently without reference to any global objective function. As a result, in general, only a limited number of segmentation errors may be corrected. Another disadvantage of segmentation-based approaches is that in some documents, there is not enough information in the lay out to perform segmentation reliably. For example, in FIG. 1, there is no clear separation between individual line items (in particular, there are no rule lines, and line spacing between items is the same as line spacing between the text lines within each item).
Accordingly, a new and/or improved method and/or system or apparatus for parsing and/or interpreting documents is disclosed which addresses the above-referenced problem(s) and/or others.