The present invention relates generally to document image processing, and specifically to methods for recognition of preprinted form documents and extraction of information that is filled into them.
In many document imaging systems, large numbers of forms are scanned into a computer, which then processes the resultant document images to extract pertinent information. Typically, the forms comprise preprinted templates, containing fields that have been filled in by hand or with machine-printed characters. To extract the information that has been filled in, the computer must first identify the template. Various methods of image analysis are known in the art for these purposes. One such method is described in U.S. Pat. No. 5,434,933, whose disclosure is incorporated herein by reference.
In order to precisely identify the location of fields in the template, a common technique is for the computer to register each document image with a reference image of the template. Once the template is registered, it can be dropped from the document image, leaving only the handwritten or printed characters in their appropriate locations on the page. For example, U.S. Pat. Nos. 5,182,656, 5,191,525, 5,793,887, and 5,631,984,whose disclosures are incorporated herein by reference, describe methods for registering a document image with a form template so as to extract the filled-in information from the form. After drop-out of the template, the characters remaining in the image are typically processed using optical character recognition (OCR) or other techniques known in the art. Dropping the template from the document image is also crucial in compressing the image, reducing substantially the volume of memory required to store the image and the bandwidth required to transmit it. For example, U.S. Pat. No. 6,020,972, whose disclosure is incorporated herein by reference, as well as the above-mentioned U.S. Pat. No. 5,182,656, describe methods for compressing document images based on template identification. The template itself need be stored and/or transmitted only once for an entire group of images of filled-in forms.
Methods of template registration and drop-out that are known in the art generally require the template to be known before compression or other processing can take place. The computer must be informed of the template type or be able to select the template from a collection of templates that are known in advance. In other words, the computer must have on hand the appropriate empty template for every form type that it processes. However, it frequently happens that not all templates or template variations are known at start-up. Furthermore, experience shows that in most systems, there is not a single template for all form types, but rather several, and that unexpected template variations may occur that cannot be distinguished by any combination of the global features that are currently used for form recognition. In the context of the present patent application and in the claims, such template variants are referred to as xe2x80x9cmutants.xe2x80x9d
Thus, in form processing systems known in the art, it is generally not possible to use template drop-out in the presence of such mutants, without costly involvement by a human operator in identifying the template to use for each form.
In preferred embodiments of the present invention, a document image processing system receives images of filled-in forms, at least some of which are based on templates that are not known in advance. The system automatically aligns and sorts these images into groups having similar template features, using any suitable method known in the art. Each such group, however, may contain multiple mutant templates, differing in one or more of their features. The present invention provides novel methods for recognizing these mutants and sorting the images in each group accordingly into precise sub-groups, or classes, each with its own mutant template. Preferably, the mutant template in each class is then extracted and dropped out of the images, thus enabling optimal image compression and other subsequent processing.
In order to distinguish the mutants in a given group one from another, the system preferably generates a gray-scale accumulation image by combining the images in the group. This accumulation image is then analyzed in order to distinguish areas that belong to the template common to all of the images from areas in which variations occur from image to image. These variations are further analyzed to determine, in each area, whether they are due to mutations of the template or to content filled into the individual forms. When it is determined that the variation in a given area is due to template mutation, the images in the group are sorted into mutant sub-groups according to their content in this area, which is referred to herein as a reference area. Typically, a sub-group created by sorting the original group on one reference area may then be subdivided into smaller sub-groups by sorting it on another reference area. This sorting process preferably continues until substantially all of the images have been divided into mutant sub-groups, each having its own template that is common to all of the images in the sub-group.
Preferably, after the sorting is completed, the respective template for each sub-group is extracted from one of the images and is dropped out of all of the images in the sub-group. The images are then automatically processed by compression, OCR and/or other document processing methods known in the art. Preferably, the extracted template is stored in a library for use in processing subsequent forms. The ability provided by preferred embodiments of the present invention to recognize and sort all mutants allows the images to be processed efficiently, reducing both the required storage volume and the costs of manual processing in dealing with large numbers of forms.
Although the preferred embodiments described herein relate to processing of images of form documents, the principles of the present invention may similarly be applied in extracting information from groups of images of other types, in which the images in a group contain a common, substantially fixed part along with individual, variable parts.
There is therefore provided, in accordance with a preferred embodiment of the present invention, a method for processing images, including:
receiving a group of the images having similar characteristics, the group including multiple classes, such that each image belongs to one of the classes and includes a fixed portion common to all of the images in the class to which it belongs, and a variable portion, which distinguishes the image from the other images in the class;
finding a reference area in the images, in which the fixed portion of the images in a first one of the classes differs consistently from the fixed portion of the images in a second one of the classes; and
sorting the images into the classes based on the reference area.
Preferably, receiving the group of the images includes processing the images to determine the characteristics thereof, and selecting the images for inclusion in the group by finding a similarity in the characteristics.
Further preferably, the characteristics include image features recognizable by a computer, and receiving the group of the images includes mutually aligning the images in the group responsive to the features. In a preferred embodiment, the images include images of form documents, and the fixed portion of the images includes form templates, and wherein the features include features of the templates.
Preferably, finding the reference area includes:
classifying a plurality of areas of the images into areas of a first type, in which substantially all of the images in the group are substantially the same, a second type, in which a sub-group of the images in the group, but not all of the images in the group, are substantially the same, and a third type, in which substantially all of the images in the group are different; and
choosing one of the areas of the second type to use as the reference area.
Further preferably, classifying the plurality of the areas includes combining the images in the group to generate an accumulation image, and analyzing the accumulation image to find the areas of the second type. Most preferably, analyzing the accumulation image includes calculating, for each of the areas in the accumulation image, an absolute threshold, indicative of a difference between bright and dark parts of the area, and a contrast threshold, indicative of a minimum significant difference between neighboring pixels in the area, and identifying as areas of the second type the areas that have a high ratio of the absolute threshold to the contrast threshold, relative to other areas of the accumulation image. Additionally or alternatively, choosing the area of the second type to use as the reference area includes generating a match score for each of the areas in the accumulation image by comparing the areas to corresponding areas in the images in the group, and selecting the one of the areas having the highest match score.
Further additionally or alternatively, sorting the images includes selecting one of the images in the sub-group as a base image and removing from the sub-group the images in the group that differ in the reference area from the base image, and repeating over the images in the sub-group the steps of classifying the plurality of the areas and choosing one of the areas of the second type so as to find a new reference area, and sorting the images in the sub-group based on the new reference area. Preferably, the steps of classifying the plurality of the areas, choosing one of the areas of the second type, and removing the images from the sub-group are repeated until substantially no remaining areas of the second type are found in the sub-group of the sorted images.
In a preferred embodiment, the images include images of form documents, and the fixed portion includes a form template, and the areas of the second type include areas in which the template of the images in the sub-group differs from the template of the images that are not in the sub-group.
Preferably, finding the reference area includes finding a first reference area so as to separate a first sub-group of the images, containing the first one of the classes, from a second sub-group of the images, containing the second one of the classes, based on the first reference area, and sorting the images includes finding a further reference area in the images of the first sub-group, and sorting the images in the first sub-group based on the further reference area.
In a preferred embodiment, the images include images of form documents, and the fixed portion includes a form template, and the variable portion includes characters filled into the template, and sorting the images includes grouping the documents such that all of the documents in each of the classes have substantially the same template. Preferably, the method includes extracting the template from the images in one of the classes by finding a substantially invariant portion of the images in the class. Additionally or alternatively, the method includes processing the images so as to remove the template therefrom, while leaving the filled-in characters in the images.
Further additionally or alternatively, the method includes removing the fixed portion from the images in the first one of the classes after sorting the images, and compressing the variable portion of each of the images that remains after removal of the fixed portion.
There is also provided, in accordance with a preferred embodiment of the present invention, apparatus for processing images, including an image processor, which is arranged to receive a group of the images having similar characteristics, the group including multiple classes, such that each image belongs to one of the classes and includes a fixed portion common to all of the images in the class to which it belongs, and a variable portion, which distinguishes the image from the other images in the class, and to find a reference area in the images, in which the fixed portion of the images in a first one of the classes differs consistently from the fixed portion of the images in a second one of the classes, and to sort the images into the classes based on the reference area.
There is additionally provided, in accordance with a preferred embodiment of the present invention, a computer software product, including a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to receive a group of images having similar characteristics, the group including multiple classes, such that each image belongs to one of the classes and includes a fixed portion common to all of the images in the class to which it belongs, and a variable portion, which distinguishes the image from the other images in the class, and to find a reference area in the images, in which the fixed portion of the images in a first one of the classes differs consistently from the fixed portion of the images in a second one of the classes, and to sort the images into the classes based on the reference area.
The present invention will be more fully understood from the following detailed description of the preferred embodiments thereof, taken together with the drawings in which: