Forms processing represents a major commercial activity whereby data filled in by hand or typed is automatically extracted from electronically scanned documents. The costly nature of data entry by hand has spurred the development of automatic systems to extract the data. A convenient way to enter data electronically is to allow users to enter written responses on a form and then optically recognizing the data using the OCR techniques. Examples include the Census and IRS tax forms. Typically, the data is keyed in by an operator but there is a growing effort to reduce costs and increase speed by scanning form documents and recognizing the hand-written text. For an example of such processes, refer to an article by Breuel T. M. entitled "Recognition of handwritten responses on U.S. census forms," Document Analysis Systems, A. Lawrence Spitz, Andreas Dengle, Editors, World Scientific Publishing, 237-264 (1994). To increase recognition accuracy, forms are often constructed as an arrangement of boxes in which the user enters characters. The boxes force the writer to clearly separate characters, obviating the need for character segmentation. False segmentation is a major contributor to recognition errors. Extraction of the hand-printed data is difficult unless one uses a drop-out color, perhaps blue. This requires the forms to be printed in multiple colors. Another approach is to use black boxes, which has the advantage of being easily and cheaply produced, but one must perform sophisticated image processing steps to remove the box as shown by Ramanaprasad, V., Shin, Y-C., and Srihari, S. N., in "Reading hand-printed addresses on IRS tax forms," in Document Recognition III, Luc M. Vincent, Jonathan J. Hull, Editors, Proc. SPIE 2660, 243-250 (1996). This may include registering the original blank form with the scanned image and subtracting the form image leaving only the characters. Unfortunately, if a character stroke overlaps the box, that character stroke would also be removed. This can severely reduce character recognition accuracy. Central to cost reduction is highly accurate optical character recognition of the data entries because the more errors there are, the more human effort must be expended to find and correct them.
One major source of errors is merged characters. This causes ambiguity in interpretation. Forms are designed to force the writer to separate each character, isolated characters being recognized more accurately as described by M. D. Garris and D. L. Dimmick, "Form design for high accuracy optical character recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 18, No. 6, June 1996, pp. 653-656. Thus forms consist of any number of frames, i.e., designated areas for data entry. Another source of errors is the overlap of entries with the frames used to guide the writer because the frame obscures parts of the characters as shown by M. D. Garris. There are two solutions to this problem in the prior art: 1) print the form in a dropout color different from the color used to draw the form so that the form and the filled in data can be distinguished optically, and 2) remove the frames algorithmically from a scanned form image and algorithmically repair the damage made to the filled-in characters. See M. D. Garris, "Intelligent form removal with character stroke preservation," Proc. SPIE Vol 2660, 1996, pp. 321-332, and B. Yu and A. K. Jain, "A generic system for form dropout," IEEE Transactions on Pattern Recognition, Vol 18, No. 11, November 1996, pp. 1127-1134, which both describe this phenomena.
The first solution has a number of disadvantages. First, one must produce a form in two colors, one color for the instructions or printed text and a lighter color for the frames. This increases the cost of form production by requiring printing on a two-color printer. Secondly, the forms must be scanned using a scanner with either a bulb the color of the dropout or the scanner must be color and the resulting image processed to remove the color. In either case, form processing costs are increased. Finally, two-color forms cannot be photocopied on black and white copiers.
The second solution requires that the form be removed algorithmically. For example, a blank form can be rendered in computer memory and registered with the scanned filled-in image and subtracted from the image leaving only the entered data. The following references provide more background for such processes: R. Casey, D. Ferguson, K. Mohiuddin, and E. Walach, "Intelligent forms processing system," Machine Vision and Applications, Vol 5, 1992, pp. 143-155; S. Liebowitz Taylor, R. Fritzon, and J. A. Pastor, "Extraction of data from preprinted forms," Machine Vision and Applications, Vol 5, 1992, pp. 211-222; U.S. Pat. No. 5,140,650, "Computer-implemented method for automatic extraction of data from printed forms."; and U.S. Pat. No. 5,542,007, issued to Chevion et a. on Jul. 30, 1996, entitled "Form dropout compression method which handles form white-out and writing in shaded and white-out areas of the form". This requires either a blank version of the form to be scanned in or a model of the form in computer memory. One can also remove the form by carefully erasing long horizontal and vertical lines (see Garris and Yu). In either case, character entries are damaged by the line removal. Strokes must be repaired using carefully crafted rules. In the case where a stroke coincides with a frame line, this is impossible. Traditionally, if no drop-out color is used, one must remove the form boxes from a scanned image somehow. This can also be done by subtracting (XOR) the original form image from a scanned image. However, one may lose parts of the character images this way. Sophisticated image processing methods are still required to preserve the character images and these methods perform poorly.
As for optical character recognition (OCR), OCR is the process of converting photonic or electronic representations of a character form into a symbolic form. In modern systems, the data are kept in computer memory, whether on a hard disk or in random access memory. The symbolic representation can then be stored and edited. The process consists of three steps: scanning, feature extraction and classification. The first step takes place by using a light-sensitive device to convert a character printed on a substrate into electronic impulses and represented as an array in a processor memory. The character might also be printed in a magnetic ink and sensed using a suitable device. Step two consists of extracting features from the character image represented as an array. Choosing a good feature set to distinguish among a set of characters, whether they are printed by machine (as in typewriting or typesetting) or by the human hand, is and has been an active area of research and development as shown in work by S. Mori, C. Y. Suen, and K. Yamamoto, "Historical review of OCR research and development," Proceedings of the IEEE, vol. 80, no. 7, 1992, pp. 1029-1058. Step three applies a decision rule to the observed extracted feature and assigns it a class, i.e., a character code. In the case of hidden-layer neural network methods for OCR, step two occurs in the first layer and step three in the second layer. For more information on OCR see U.S. Pat. No. 4,034,343, issued to Michael E. Wilmer on Jul. 5, 1977, entitled "Optical character recognition system", describes prior art in OCR in the spatial domain. U.S. Pat. No. 3,582,884, dated Jun. 1, 1971, entitled "Multiple-scanner character reading system", describes a OCR system on a communications network in which characters are scanned and represented as signals. Decoding from the signals to the original video scan data is done before recognition.
In order to overcome the shortcoming of the prior art regarding form rendering and character recognition it is an objective of this invention to generate forms simply and easily using halftones on inexpensive printers along with a set of low-complexity image processing steps to extract the characters for recognition.
All of the references cited herein are incorporated by reference for their teachings.