The present inventive subject matter relates generally to the art of automated document processing. Particular but not exclusive relevance is found in connection with parsing of semi-structured documents, e.g., such as sales receipts, invoices, tables, lists, etc. The present specification accordingly makes specific reference thereto at times. However, it is to be appreciated that aspects of the present inventive subject matter are also equally amenable to other like applications.
It has been found beneficial in some cases to parse a semi-structured document into simple, salient tokens of information, referred to as “fields”. For example, typical fields found on a receipt can include item numbers, item descriptions and prices for a series of purchased items reflected on the receipt. Parsing semi-structured documents into fields can be a challenging problem, e.g., because of high variability in the layout of different documents and because of strong and/or complex interactions among fields. Commonly, fields may interact in pairs, e.g., where the end of one field determines the beginning of another. Fields may also commonly interact in groups. That is to say, a semi-structured document may include an array of records having a similar layout style, where each record consists of a group of fields. For example, in the case of a sales receipt or the like, each sale item on the receipt may correspond to one record which includes a group of fields, e.g., such an item number, an item description, a price, etc. Because of these characteristics, an optimization of relevant criteria has traditionally involved searching for a best or optimal parsing solution over an entire space or vast number of hypothetical parsing solutions. Such an operation can be a prohibitive task for long, multi-page documents.
Currently, many individuals and/or businesses may perform the aforementioned parsing task manually. This can lead to significant costs of document processing. For example, it has been estimated that the cost of processing a single invoice is not insignificant. In some cases, large businesses may process tens of thousands of invoices per day, which can result in disadvantageously high operating costs. Accordingly, some may find it desirable to reduce the manual effort involved in parsing semi-structured documents.
Some automatic approaches have been proposed and/or developed to address semi-structured document parsing. In accordance with some of these approaches, candidate fields are first identified either one-by-one or by segmentation of an entire document and subsequently assign a type (or functional role) to them. Some such approaches make local decisions and often perform poorly in cases where ambiguities exist regarding the type assignment. When ignoring the interactions among fields, they can be susceptible to the significant variations in the layout of different document. Moreover, some may not recover from errors in candidate field identification. Finally, some previous approaches are restricted to single-page documents and, thus, cannot handle cases, e.g., where a table, or invoice, is divided into more than one page.
Accordingly, a new and/or improved method and/or system or apparatus for parsing semi-structured documents is disclosed which addresses the above-referenced problem(s) and/or others.