Form processing requires extracting data associated with specified fields within a document and storing that data in a machine understandable format. Extracted information must be categorized as belonging to specified fields. Because of the variety of fields, schemas and data types available, form processing is often limited to specialized computing systems that automatically process forms for a particular industry. This automatic processing has two levels of errors. On a first level, the input data itself may be corrupt or contain errors. On a second level, the categorization may be incorrect. Approaches for minimizing these errors have been attempted.
One approach to minimizing form processing errors includes creating a web form that associates data with an input field before the data is input by a user. However, a user is not always available to input data into the categorized form fields. For industries where different entities generate large numbers of documents automatically, this approach is not scalable.
For example, sales are often processed by two different entities residing in two independent organizations. Rather than having immediate transfers of money with each transaction, many small businesses record each transaction in an accounts receivable ledger and generate an invoice once a month. Under this system, goods and services are transferred separately from the financial transactions, so payment is not made at the time of receiving the goods or services. Instead, payment occurs when an accounts payable entity matches a purchase order and a receiving report with data from an invoice. In this example, automatic form processing is desirable because many small business suppliers do not have the tools or personnel to create an integrated system with each buyer.
An approach for speeding up this process is using an invoice management system to perform optical character recognition (OCR) on the invoices and categorize the content of the invoices based on the OCRed information. For example, in the industry of invoice management, a specialized machine scans an OCRed document for a specific word such as “Price”, and associates the next set of numbers with the price field. In the same manner, a specialized machine scans the document for “Purchase Order Number,” and associates the next set of numbers with the purchase order number field. This approach is subject to error because a set of numbers following a field name does not necessarily associate with that field. For example, a document containing the words “Price” and “Purchase Order Number”, may have price content close to the word “Purchase Order Number” and vice versa. Additionally, some invoices may contain multiple words that define the field category or no words that define a specified field. Such conflicts result in erroneous categorizations.
The current approach for handling conflicts of this nature involves outsourcing all invoice documents to a third party for validating the OCRed data extraction by visual inspection. If there is anything wrong with an OCR or categorization, the third party is required to fix the problem to the best of their ability. This approach is susceptible to automation errors associated with extracting data and categorizing it. This approach is also susceptible to human error, especially, because the third party does not have firsthand knowledge regarding the forms he or she is reading.
In addition to errors, the current approach has setbacks related to repetitious data input and data transfer. Specifically, the data must be entered by a supplier and transferred to a buyer as an invoice. The data is then extracted and transferred to a third party, where additional data entry may need to occur during the manual validation process. Then, the data is transferred back to the buyer where additional data entry may need to occur. For example, if the invoice amount is erroneously different than the purchase order price, additional data entry must occur, or the supplier needs to generate an entirely new invoice. Thus, the system breaks down when errors occur. An improvement is desired to eliminate the redundant computations associated with data entry and data transfer.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.