1. Technical Field
The invention disclosed broadly relates to data processing and more particularly relates to forms recognition of document forms.
2. Background Art
This patent application is related to the U.S. Pat. No. 5,251,273 entitled "Data Processing System and Method for Sequentially Repairing Character Recognition Errors for Scanned Images of Document Forms," by T. S. Betts, et al., the application being assigned to the IBM Corporation and incorporated herein by reference.
This patent application is also related to the U.S. Pat. No. 5,305,396 entitled "Data Processing System and Method for Selecting Customized Character Recognition Processes and Coded Data Repair Processes for Scanned Images of Document Forms," by T. S. Belts, et al., the application being assigned to the IBM Corporation and incorporated herein by reference.
This patent application is also related to U.S. Pat. No. 5,140,650, entitled "A Computer Implemented Method for Automatic Extraction of Data From Printed Forms," by R. G. Casey, et al., the patent being assigned to the IBM Corporation and incorporated herein by reference.
This patent application is also related to the U.S. Pat. No. 5,455,872 entitled "System and Method for Enhanced Character Recognition Accuracy by Adaptive Probability Weighting," by M. P. T. Bradley, the application being assigned to the IBM Corporation and incorporated herein by reference.
This patent application is also related to copending U.S. patent application by D. W. Billings, et al. entitled "Method for Defining a Plurality of Form Definition Data Sets," Ser. No. 08/100,846, filed Aug. 2, 1993 now pending, the application being assigned to the IBM Corporation and incorporated herein by reference.
3. Background Art
The referenced Billings, et al patent application, describes how forms are created by a forms definition utility program. A forms definition data set is prepared at a data center, which characterizes the preprinted background of the master form. The forms definition data set is associated with a form ID. Copies of the master form are distributed to persons who will fill out the fields entering data by hand or by typewriter. The completed forms are returned to the data center and are scanned into the system. To accommodate sessions of high volume scanning of submitted forms, the image of each form is compressed and buffered until there is an opportunity to continue its processing.
The compressed form is then decompressed, and the image of the completed form is subjected to a forms recognition program to identify the ID of the form. Once identified, the master form definition data set can be accessed. This enables the system to locate the fields on the form and subtract out the preprinted background of the form. The extracted field images can then be presented to a character recognition program, which analyzes them and outputs alphanumeric strings representing the images of the data in the fields. If there are suspicious characters or errors in the recognition process, the character recognition program will also output error statistics.
Several problems surround conventional techniques to perform forms recognition. First, the speed of forms recognition is relatively slow, when compared to other steps in forms processing. Many conventional forms recognition algorithms are slow, such as line geography techniques. The process of decompressing the compressed image is also slow, since conventional techniques are based on analyzing the entire page of the form, requiring the entire form to be decompressed. In addition, significant problems appear with the condition of the completed forms, themselves. Many submitters fold, spindle, staple or otherwise mutilate the form, and each artifact will appear in the image of the form. Such artifacts reduce the accuracy of forms recognition and occasionally the wrong ID is attributed to a completed form image. This will not be apparent until the character recognition program returns high error statistics for the form.