The present exemplary embodiments relate to document information extraction and find particular application in connection with systems and methods which extract information from documents and will be described with particular reference thereto. It is to be understood, however, that it also finds application in other usage scenarios, and is not necessarily limited to the aforementioned exemplary embodiment.
Currently, there are numerous methods for capturing data from documents and transferring this information into a software application. For example, it is common practice for accountants to use a paper-based process of capturing receipt data and manually transferring the receipt data into their accounting software. The process involves collecting receipts over some period of time and taking a concentrated block of time to enter data from the receipts into the accounting software. This process, however, is very time consuming and requires a large amount of manual interaction to capture and transfer the document data.
To make this process more efficient, document processing systems are utilized to automatically capture and store the information from documents. In typical document processing systems, information is typically extracted from forms by consulting a model of a form which contains information about what information fields are to be found and where they are located on the document. In this approach, each “expression” of a particular document genre is treated as a separate form, and requires a separate model. Thus, in a workflow with a mixture of different forms, the first step in the document processing would be a classification step to select the correct model to use for a specific document. This would be followed by an extraction step which would use specific information about that form's layout to identify information zones and extract the information from each zone.
However, this approach does not work well for documents not arranged in a known format such as but not limited to receipts. Most receipts are not typical forms with specific areas which get filled in with specific pieces of information. They are better viewed as reports generated by computer programs which can change constantly, to add advertising, coupons, information about special offers, and other such printed material. In addition, there are many different “layouts” for receipts, which would make the task of identifying each “form” and building a model for it is quite expensive and would make the classification task difficult.
The present exemplary embodiment provides a new system and method which overcomes the above-referenced problems and others.