In many document imaging systems, large numbers of forms are scanned into a computer, which then processes the resultant document images to extract pertinent information. Typically the forms comprise pre-printed templates, containing fields that have been filled in by hand or with machine-printed characters. To extract the information that has been filled in, the computer must first identify the fields of the template and then decipher the characters appearing in the fields. Various methods of image analysis and optical character recognition (OCR) are known in the art for these purposes.
In order to identify the fields of the template and assign the characters to the correct fields, a common technique is for the computer to register each document image with a reference image of the template. Once the template is registered, it can be dropped from the document image, leaving only the handwritten or printed characters in their appropriate locations on the page. For example, U.S. Pat. Nos. 5,182,656, 5,191,525 and 5,793,887, whose disclosures are incorporated herein by reference, describe methods for registering a document image with a form template so as to extract the filled-in information from the form. Once the form is accurately registered with the known template, it is a simple matter for the computer to assign the fill-in characters to the appropriate fields. Dropping the template from the document image also reduces substantially the volume of memory required to store the image.
Methods of automatic form processing known in the art, such as those described in the above-mentioned patents, assume as their point of departure that the form template is known in advance, or at least can be selected by the computer from a collection of templates that are known in advance. In other words, the computer must have on hand the appropriate empty template for every form type that it processes, together with a definition of the locations and content of all of the fields in the form. This information is typically input to the computer by an expert operator before starting up processing operations. In large-scale form-processing applications, however, it frequently happens that not all template or template variations are known at start-up, or that unexpected variations occur. The variant forms are rejected by the computer and must be passed to manual processing—either for manual key-in of the data or to train the computer to deal with the new templates. Needless to say, any involvement by a human operator increases the cost and time required for processing, as well as increasing the likelihood of errors.