It is often necessary for companies to extract information from various types of documents such as invoices, purchase orders, correspondences, etc., and to enter this information in their own data systems so that they can use the information for various enterprise operations and processes. While the process of extracting document information, converting the extracted document information into a computer-usable form and entering the converted information into a data processing system was historically performed manually, computer-based systems have been developed to automate this process. Computer-based systems are available, for example, to perform optical character recognition (OCR) on images of scanned documents and to thereby generate digital data from the images. The strings of recognized characters can be processed according to a predetermined set of algorithms to identify information that is represented by the character strings. OCR techniques are known to those skilled in the art and thus are not further described herein.
While automated data extraction systems can reduce the burden of having to manually identify and re-enter the information that is found in business documents, these systems still have some drawbacks. For instance, traditionally, it has been necessary to program these systems with an extensive set of rules that are used to determine the type of information that is represented by a particular string of characters. For example, a rule may be needed to determine that a character string “New York” represents a city name, or a rule may be needed to determine that a character string “Smith” represents a person's surname. This programming requires a great deal of time and effort, and any exception to the predefined rules may “break” the algorithm and require special handling. In addition to the high cost of programming these systems, it is usually necessary to perform manual validation of the results generated by the systems, which further increases the cost.