In certain areas, like government, health care, human resources, and insurance, the daily processing of a variety of paper forms is a routine and important activity. The processing of a form often involves: the extraction of the information on the form supplied by the users; specific actions that are governed by the specific nature of the extracted information; and, possibly, the archiving of the extracted information and/or the form itself in a manner that facilitates subsequent use of the archival information. While all of these steps can, and often are, performed by a human, the processing of large number of forms on a timely basis can by means of digital computing devices would be desirable.
One common step in the automation of forms handling is the digitization of one or more forms by means of an appropriate scanning device. The result of the scanning process is a set of information representing the digitized form. The set of information is normally a rectangular array of pixel elements of dimensions W and H where the “width”, W, is the number of pixels in each horizontal row of the array and the “height”, H, is the number of pixels in each vertical column of the pixel array. The columns may be identified, for purpose of discussing such a set of information, by an index, I, whose values can range from 1 to W; and the rows can be identified by an index J whose values range from 1 to H where W, H, J and I are integer values. If a pixel array itself is labeled as P, then the value of a pixel in the column with index I and row with index J is labeled for discussion purposed as P(I,J). The ordered pair (I,J) is sometimes called the “address” or “pixel location” of this pixel. This is illustrated in FIG. 1. FIG. 1 includes an exemplary pixel array 100 in which column 1102, exemplary column I 104, column W 106, row 1 108, exemplary row J 110, row H 112, and exemplary pixel location (I,J) 114 are identified.
While the particular colors that are used on forms can vary from application to application, most forms have only two distinguishing color features, the background color and the foreground color. It is common practice to set the values of all pixels representing the background color to the number 0, as illustrates with background pixels 116 in FIG. 1, and all pixels representing the foreground color to the value 1, as illustrated with foreground pixels 118 in FIG. 1.
The automatic determination of the type of a filled-in form is often the most basic step after the initial digitization step. Subsequently, automatic alignment of a blank form with a filled-in version of itself can enable the separation of annotations on the filled-in form from the form itself. This is often a prelude to subsequent processing of the annotations. An automatic alignment process can also be a step in automatic matching of forms.
Forms that are purely digital in the sense that they are generated and completed, or annotated, in the digital domain without being transformed into physical entities can be recognized and analyzed using software based on template matching methods, text recognition methods, and methods particular to the form and industry in which it is used since form lines and information locations are well defined and not subject to distortions which may occur from the use of paper forms, copies and/or scanning. Forms that are completed “on-line” or with word processing programs are examples of digital forms of this type. However, many forms are completed on paper and are scanned into digital form at a later time.
One method of identifying the type of a paper form which has been scanned is described by Bergelson, et al. in U.S. Pat. No. 6,697,054. The system described in U.S. Pat. No. 6,697,054 utilizes data derived from one or more identification marks made manually in pre-printed portions of the form. It compares this data with data of similar type residing in a database and on the basis of these comparisons identifies the type of the form. However, many forms in common use do not require such extraneous user input. In addition, such a requirement is subject to being ignored or completed erroneously.
A digitized embodiment of an annotated or filled-in form may differ from a digitized embodiment of the underlying blank, or unannotated, form by subtle local difference or perturbations that are not readily discernible to the human eye. Such perturbations can arise even to unannotated forms through the common processes of printing, faxing, photocopying, and handling of forms. For example, a slight misalignment of the paper form in the scanning process or a slightly warped sheet of paper in the process of printing the blank form can result in a paper form that differs little from the original when viewed by the human eye but whose digital embodiment is not bit-for-bit the same as the original form. For example, an insurance form transmitted by a facsimile machine to a patient may be photocopied by the patient who makes annotations to it, given to a physician who makes further annotations, and then faxed back to the insurance company. A digitization of the final form will to the human eye appear to have the same underlying form as the original. But a digital computer can have a problem just in aligning the forms.
FIGS. 2 and 3 illustrate a common problem that arises in the identification problem. FIG. 2 is an illustration 200 of an example of the digital representation of a blank form; its dimensions comprise a width of 3206 pixels and a height of 2467 pixels. FIG. 3 is an illustration 300 of an example of the result of at least the following operations having been applied to FIG. 2: (1) print it to paper, (2) fill it in with pen or pencil, and (3) scan the filled-in form. The dimensions of FIG. 3 comprise a width of 3237 and a height of 2469. Notice that the operations have caused the dimensions of these forms to differ and that FIG. 3 contains additional noise of unknown origin, possibly caused by the additional steps of photocopying or faxing.
Despite previous attempts to automate form recognition, there remains a need for form identification methods that can be applied to the diverse varieties of forms currently in use in that can be used to identify in an automated manner, which, if any, form in a database of blank forms is the form underlying a given filled-in, or annotated, paper form.