Pre-printed forms are commonly used for many applications. For some applications, such as medical claim forms, the forms are used in such great numbers that computerized "reading" of the forms is not only desirable but essential. Form documents are typically pre-printed sheets of paper that have blanks or open areas on them where information is to be supplied by the person completing the form. This information, referred herein as "user data" may be entered by hand or a printing device such as a printer or typewriter. Other examples of commonly used forms include shipping documents, purchase orders, insurance records, and so forth.
To facilitate interpretation of these forms and the retrieval of user data from them, it becomes necessary to be able to distinguish and separate the user data from the information which was previously printed on the form. For example, the form may have been pre-printed with instructions, boxes to be filled in, and other markings. Removable of the pre-printed markings before attempting to "read" the user data, for example using optical character recognition (OCR) systems, is highly desirable.
If the pre-printed form is well defined, for example in a computer data file, or if a clean, blank pre-printed form is available for scanning, one can take steps to "subtract" the pre-printed indicia from a completed form, leaving only the user data for a subsequent interpretation. This approach would work fine, theoretically, but in practice there are often conflicts between the pre-printed markings and the user data. This occurs wherever the user data overlaps the pre-printed markings. Digital data corresponding to these common areas as referred to later as "shared pixels." Shared pixels can arise when a line or box on the form intersects the data to be extracted, for example as illustrated in FIG. 1 (further described below). Shared pixels also arise where pre-printed text such as a zone description or other instructions on a form intercept the data to be extracted. For example, referring to FIG. 2, one portion of a pre-printed form includes the zone description "patient's name." In the figure, the patient's last name "MOSER" was subsequently printed or typed onto the form in a location that overlaps the zone description. This makes it difficult to "read" the user data, especially using automated OCR techniques.
One solution to this problem has been to employ a particular printing color for the form and a corresponding color filter for scanning for the form after it has been completed. In such situations, a different color ink is employed to complete the form, in other words to enter user data, from the color used to print the form. During scanning of the completed form, a matched color filter is used to block the pre-printed markings from being scanned. This technique is functional, but it is severely limited because form suppliers, or businesses that use forms, are forced to employ specially selected types of ink and colors, as well as scanners that are especially adapted to filter selected ink colors. It would be desirable to be able to extract user data from pre-printed form without restricting the ink colors that are used either on the form or to enter user data.
U.S. Pat. No. 5,694,494 (Hart, et al.) entitled "electronic retrieval of information from form documents" describes a method for retrieving user data from a scanned version of a completed document. That method includes the steps of obtaining a first image of the document having information printed thereon in its blank format before other information has been added to it by the user. A second image of the document is obtained after information has been added to it by the user. The two images are aligned, and for each pixel in the first image which corresponds to information on the second document, those pixels are deleted from the second image to create an image which corresponds to subtraction of the first image from the second image. Finally, a step is performed to electronically restore the information added by the user which was deleted during the subtraction operation. See abstract. While the described "subtraction" step is not difficult, restoring missing data is quite challenging. The methods described in the --494 patent for restoring missing user data require extensive analysis and calculations, and therefore can be expected to be reasonably slow in operation.
Another method for "restoration of images with undefined pixel values" is described in U.S. Pat. No. 5,623,558 (Billawala, et al.). That patent describes method for using a threshold value and a "neighborhood configuration" to restore an image. According to the patent abstract, "the neighborhood configuration defines a geometric region, typically a fixed number of pixels, surrounding the target pixel. The threshold value specifies a number of pixels in the neighborhood configuration for which pixel values are known. In our [that] system, for each pixel in one of the unknown regions, an analysis is performed over the entire area defined by the neighborhood configuration. If the threshold number of pixels within that region is known, then the value of the unknown pixel is calculated. If the threshold value is not achieved, then analysis proceeds to the next pixel location. By continuing the process and reducing the threshold value when necessary or desirable, the complete image can be restored." See abstract.
In view of the foregoing background and brief summary of the prior art, the need remains for method to separate and remove pre-printed markings such as zone descriptions from a completed form that is reliable, simple to implement and fast in operation.
It is also well known to provide defined spaces, called constraint boxes, for users to write in on a pre-printed form. Frequently, one box is provided for each character to be entered by hand and later recognized by a computer. Constraining the character locations is very helpful to the OCR process. For example, the computer can assume that what appears within a single constraint box is indeed a single character, thereby simplifying the recognition problem. Still, the problem arises that user's often fail to constrain their handprint within the spaces provided. User markings (data) often encroach into or across the constraint boxes. Since the constraint boxes are part of the pre-printed form, we again have the problem of separating user data from the blank form.
One solution in the prior art is to print the form, and more specifically the constraint boxes, using "dropout ink" as described above. Use of dropout ink enhances readability for handprint and machine printed data as well. However, it requires special inks to prepare the form and special equipment to drop out the form during scanning. The need remains for separating printed forms, including constraint boxes, from user data in order to recover the user data without requiring special inks or special scanning equipment.