1. Technical Field
The invention disclosed broadly relates to data processing systems and methods and more particularly relates to techniques for the extraction of field images from scanned document images.
2. Related Patent Applications
This patent application is related to the copending U.S. patent application Ser. No. 07/870,129, filed Apr. 15, 1992, entitled "Data Processing System and Method for Sequentially Repairing Character Recognition Errors for Scanned Images of Document Forms," by T. S. Betts, V. M. Carras, L. B. Knecht, T. L. Paulson, and G. R. Anderson, the application being assigned to the IBM Corporation and incorporated herein by reference, now U.S. Pat. No. 5,251,273.
This patent application is also related to the copending U.S. patent application, Ser. No. 07/870,507, filed Apr. 17, 1992, entitled "Data Processing System and Method for Selecting Customized Character Recognition Processes and Coded Data Repair Processes for Scanned Images of Document Forms," by T. S. Betts, V. M. Carras and L. B. Knecht, the application being assigned to the IBM Corporation and incorporated herein by reference.
This patent application is also related to the copending U.S. patent application Ser. No. 07/305,828, filed Feb. 2, 1989, entitled "A Computer Implemented Method for Automatic Extraction of Data From Printed Forms," by R. G. Casey and D. R. Ferguson, the application being assigned to the IBM Corporation and incorporated herein by reference, now U.S. Pat. No. 5,140,650.
3. Background Art
The above referenced copending patent applications by T. S. Betts, et al. describe the system context within which the invention disclosed herein finds application. The system disclosed by T. S. Betts, et al. defines document forms and then reads filled-in copies of those document forms which are scanned in to a digital imaging system. Each document form which is defined, includes several fields within which handwritten or typed information is to be entered. The T. S. Betts, et al. system examines the digital image of a scanned-in form to identify the form, and then locate the respective fields from which images are extracted for character recognition operations.
The process of field extraction for digital images of document forms is made difficult by the presence of preprinted background information in the form of text, boxes, and other visual prompts which have been included on the form to assist the person filling in the form. This preprinted background information must be identified in the scanned-in copy of the form and deleted from the form. This problem becomes acute when it is realized that the scanned-in form will be offset and skewed by virtue of the mechanical imprecision of the scanning device employed. Additional problems of an even more severe nature are encountered where the person filling in the form misregisters the handwritten or typed characters. Characters which overlap or go beyond the boundary of the fields will typically will not be completely extracted in prior art field extraction processes. In addition, where artifacts such as marks, creases, staples or other unintended marks which appear on the scanned-in image of the form, create still more severe problems in discriminating between the intended characters and the unintended artifact marks on the form.