The present exemplary embodiments relate generally to document processing. They find particular application in conjunction with finding repeated structure within document images for data extraction, and will be described with particular reference thereto. However, it is to be appreciated that the present exemplary embodiments are also amenable to other like applications.
Many documents contain repeated structure. Forms, templates, and letterheads are examples of structure repeated across multiple documents. Layouts of headings, body text, and captions are examples of structure repeated across pages within a single document. Bulleted lists and tabular data are examples of structure repeated within a single page of a document. Prominent examples of the latter are invoices, receipts, and many healthcare forms. In these cases, instances of repeated structure of interest are called records. For example, in an invoice each record corresponds to a single product or service purchased. Records are composed of individual fields (such as ‘unit price’ or ‘quantity’). These fields are laid out in a consistent (but unknown in advance) spatial structure.
Repeated structure conveys vital perceptual and semantic cues to a human reader. The relationships among the elements are encoded implicitly via the spatial structure. Identifying and extracting repeated structure is useful in a variety of applications. For example, product names extracted from an invoice can be matched to an database to verify receipt before remitting payment. As another example, in the healthcare domain, blood test results often describe each test performed. The results of these individual tests can be extracted, accumulated, and plotted as a function of time to display trends.
Despite its recognized value in business workflows, data extraction tasks suffer from inadequate or unreliable levels of automation. Consequently, data extraction is still largely done manually. However, the cost of manual data extraction can be quite high. For example, manually processing a single invoice can cost up to 9 Euro. For large businesses which can process tens of thousands of invoices per day, manual processing can dramatically increase the cost of operations. Accordingly, there is a strong need for a reliable, automatic approach to data extraction.
Automatic extraction of repeated structure from documents is a challenging task for a number of reasons. Variations in the content of individual fields induce significant variability in the structure. This variability includes changes in the field's visual appearance, as well as width and height. These changes, in turn, induce variations in the relative placement of other fields and in the presence and appearance of field separators. Many cues typically used for data extraction become difficult to exploit in these circumstances. For example, while whitespace gaps form a useful cue for field boundaries, they may be absent in some documents due to overlaps between different fields (interlacing). As another example, if different records occupy a different number of text lines, the periodicity structure will be disrupted as well, and the relative positions of different fields will not be consistent across items. When dealing with repeated structure across multiple documents, variations in the generation of these documents present an additional problem. For example, while the layout of a company's invoices may be centrally specified, local branches may deviate from this layout in different ways. Similarly, although standards governing the layout of medical claims forms exist, different hospitals deviate from this layout in unpredictable ways. In addition, variations and problems in the scanning process (such as paper slipping or other distortions) are also unpredictable, vary from document to document, and introduce an additional source of layout variability. All these difficulties make data extraction difficult to approach with shrink-wrapped automated solutions.
One approach to automatic data extraction is called ‘wrapping’. In wrapping, one instance of the structure to be extracted is marked by a user. Subsequently, the approach ‘wraps’ (finds and extracts) additional instances of this structure. This wrapping is based on subgraph matching. The biggest drawback of this approach is that it does not learn from the annotation specified by the user. Instead, the user manually specifies the conditions for subgraphs to match. This requires significant effort and technical expertise. In addition, the specific set of conditions used in the wrapping approach may not be powerful or flexible enough for some problems.
Another approach, which is specific to extracting information from invoices, extracts consecutive lines that share similar token structure. Similarity is measured by token content (numerical/alphabetic/alphanumeric) as well as by token alignment. In this approach, ‘main lines’ are identified as text lines containing a real number (which usually corresponds to the price). Projection profiles on the main lines are then used to find columns. Periodicity structure of the main lines is used to assign the invoice to one of several predefined types.
In yet another approach, which is again specific to extracting information from invoices, similar ideas as the previous approach are used. According to this approach, a database of known invoice structures is used to extract structure from invoices of familiar types and token alignment is used as a cue to identify repeated structure in unfamiliar invoices.
Although some of the approaches described above are used in practice, it is desirable to extend their range of applicability. Namely, most of the current approaches assume fields are organized in widely separated columns and that no interlacing is present. Further, most current approaches assume records are periodic (i.e., the heights of line items are equal). As a result, when these assumptions fail, documents cannot be processed.
One reason for these limitations is that the cues used for making the decisions are relatively weak. For example, while the alignment of different fields may indicate a match, it is by no means a guarantee of a match (because non-matching fields are often aligned by accident, while matching fields may occasionally be misaligned). Another reason is that the decision function used to integrate the cues is often ad-hoc, relying on a series of thresholds to select a set of matches. Using hard thresholds is particularly problematic since, as mentioned above, any single cue might fail for a particular document.
The present disclosure contemplates new and improved systems and/or methods for document image analysis, including systems and/or methods that remedy these and other problems.