This application relates generally to image generation methods, and more particularly to the automatic generation of images of documents such as forms populated with data for testing of automatic document imaging and processing systems and methods.
In spite of the increasing prevalence of electronic data processing and communications systems and their widespread use in business, the goal of a “paperless” business environment has yet to be realized. Many modern businesses are still being inundated daily with increasing volumes of paper that must be processed as part of their normal business activities, and from which information must be extracted and utilized. For example, many businesses still rely upon paper forms for documents such as purchase orders, invoices, and the like. Processing such documents is labor intensive, time consuming and inefficient. Moreover, the need to transfer information from such documents to systems such as accounts payable or electronic order processing systems is subject to error. For businesses such as large retail chains processing the number of invoices that originate from the many different vendors who service the business, or from service providers to the business such as electricians, building maintenance providers, etc., represents a significant administrative burden. Data must be extracted from the invoices, validated against purchase orders and vendor data, and line items must be checked for correct units of measure and price, for instance. Moreover, businesses constantly receive documents of other kinds from a broad variety of different sources that must be captured electronically and processed. While current accounts payable, order processing, and other such electronic data processing systems do a good job of reducing the administrative burden of business procedures, a significant problem exists in extracting data from paper documents for input into electronic data processing systems.
The burden of manually extracting data from paper documents for input into electronic data processing systems has lead to the development of a number of different products and systems for optically processing paper documents to extract and digitize information from the documents. Optical character recognition (OCR), image character recognition (ICR) and similar approaches can read and extract data from documents. However, while information on a document may be correctly read, ascribing the appropriate meaning to the information is a formidable task. Most types of documents, such as invoices, are not standardized, and relevant information may appear on the document at different locations and in different formats. Different vendors may have invoices forms that have widely different appearances, even for the same kinds of products. Moreover, the paper forms may be creased or skewed in an optical scanner or include other types of artifacts such as extraneous markings, handwriting, or date or received inked stamps.
Similar problems exist in processing forms other than invoices and purchase orders. For example, insurance claim forms, credit card applications, and the like, all pose similar problems for optical scanning and automatic data processing systems. Accordingly, many different optical and document processing products have been developed or are in development for processing paper forms and similar documents to extract information for entry into back-end data processing systems. For example, the assignee of the present invention has developed computer hardware and software systems for processing forms to automatically identify, extract and perfect data for export into back-end databases or other systems, such as document or content management systems, or data processing systems. These automatic document and form processing systems are continually undergoing improvement and redesign to improve their performance and accuracy, as well as to extend them to different document processing applications. As with all engineering development programs, developing new products and approaches relies upon testing to determine whether they perform as designed and intended, and how their performance may be improved.
In order to test document processing systems and approaches that are undergoing development or improvement, a large number of samples of test documents are required. The test documents are processed by the systems, and the results of the processing are compared to the test documents to determine how well the system processed the documents. Generally, hundreds of different images of forms populated with data need to be generated for adequately testing accuracy, quality of processing, and throughput of the document processing application. Likewise, the ability of new system and products to process images that include extraneous information and artifacts such as rotations, shifts or other marks in the document needs to be tested, and appropriate test samples are required for this purpose. For certain types of documents, a large number of test documents having the same template but different information are required. To obtain adequate performance statistics to permit accurate predictions and performance probabilities of a document processing system, the large number of different test samples is necessary in order to derive sufficient representative statistical information as to the system performance. For instance, given a blank template of a medical claim form, hundreds of images of the claim form with different information may need to be generated in order to test the system's ability to accurately extract information from the forms and correctly interpret the extracted information. As with any statistical process, the greater the number of samples processed, the more accurate the performance predictions.
Generating the large number of test samples required for adequate testing of form processing systems and applications is itself a significant and time-consuming effort. The test samples needed should include as many different variations in format and data as can be reasonably anticipated to be encountered in use. To produce manually the large and varied number of test samples required is burdensome. Accordingly, what is needed is a system and method that affords the efficient, flexible and rapid production of a large number of different test samples of the types a document processing system is intended to process and that preferably have wide variations in data format and information content. It is desirable to provide systems and methods that satisfy these objectives and address other problems of testing of automatic document processing systems, and it is to these ends that the present invention is directed.