1. Technical Field
The invention disclosed broadly relates to data processing systems and methods and more particularly relates to techniques for the forms recognition of scanned document images.
2. Related Patent Applications
This patent application is related to the U.S. Pat. No. 5,251,273, issued Oct. 5, 1992, entitled "Data Processing System and Method for Sequentially Repairing Character Recognition Errors for Scanned Images of Document Forms," by T. S. Betts, V. M. Carras, L. B. Knecht, T. L. Paulson, and G. R. Anderson, the application being assigned to the IBM Corporation and incorporated herein by reference.
This patent application is also related to the U.S. Pat. No. 5,305,396, issued Apr. 19, 1994, and entitled "Data Processing System and Method for Selecting Customized Character Recognition Processes and Coded Data Repair Processes for Scanned Images of Document Forms," by T. S. Betts, V. M. Carras and L. B. Knecht, the patent being assigned to the IBM Corporation and incorporated herein by reference.
This patent application is also related to U.S. Pat. No. 5,140,650 entitled "A Computer Implemented Method for Automatic Extraction of Data From Printed Forms." by R. G. Casey and D. R. Ferguson, assigned to the IBM Corporation and incorporated herein by reference.
This patent application is also related to U.S. Pat. No. 4,992,650 entitled "Method and Apparatus for Bar Code Recognition in a Digital Image," by P. J. Somerville, assigned to the IBM Corporation and incorporated herein by reference.
3. Background Art
The above referenced copending patent applications by T. S. Belts, et al. describe the system context within which the invention disclosed herein finds application. The system disclosed by T. S. Belts, et al. defines document forms and then reads filled-in copies of those document forms which are scanned in to a digital imaging system. Each document form which is defined, includes several fields within which handwritten or typed information is to be entered. The T. S. Bens, et al. system examines the digital image of a scanned-in form to identify the form, and then locate the respective fields from which images are extracted for character recognition operations. ?
The processing of document form images includes the stages of defining a master form image, recognizing completed form images of the master, separating field images from the completed form, and recognizing the text characters in the field images. Such processing is described in the above referenced T. S. Belts, et al. patent applications. Each master form has a unique identity name or number (ID) assigned to it, to distinguish it from other master forms. Each master form has an array of fields generally delineated by preprinted horizontal and vertical lines, within which data can be marked, thereby making a completed form. The shape of the horizontal and vertical lines is the line geography of the form.
The definition of a master form image is stored in a form definition data set which includes the name of the form, the value of any preprinted bar code or OCR code, a characterization of the preprinted line geography, the location of the fields and usually characterizations of the type of text that is expected for each of the fields. The objective of the forms recognition process is to take a completed form of an unknown identity and infer its identity from clues contained in the image. Once the identity is ascertained, the correct form definition data set can be selected to enable location and processing of the data written into the fields of the completed form.
Line geography can be used at the time of forms recognition to identify the master form corresponding to the completed form being processed. A simple example would be using the number of horizontal and vertical lines on the master form to recognize corresponding, completed forms. Generally, however, much more complex characterizations of line geography must be employed in matching operations to recognize and distinguish forms whose preprinted shapes are similar.
Usually a master form will also have an identifying mark such as a preprinted bar code or a preprinted optical character recognition (OCR) code that can be used at the time of forms recognition to identify the master form corresponding to the completed form being processed. An example process of bar code location and reading is described in the above referenced P. J. Somerville patent. Although the process of locating a bar code location and reading it or the process of locating an OCR code location and reading it require an interval of time to complete, they generally are faster than the process of using line geography and matching operations to perform forms recognition of a particular completed form being processed.
Another requirement of the forms recognition process is the assessment of the quality of the scanned image received by the scanning device. If the scanned image of the completed document is misregistered, that information must be passed on to the field separation stage and/or the character recognition stage, to enable a faster and more accurate location of the fields and text on the completed form. If the image of the completed form is rotated slightly from the axis of travel of the document in the scanner, the error is called skew. If the image is displaced vertically in the direction of the path of travel of the document in the scanner, this is called offset. Scanners vary in quality and the degree of skew and offset of the completed images will vary with the particular scanner in use and also with the technique of the operator feeding documents into the scanner. Skew and offset are usually measured during the forms recognition process, after the identity of the completed form has been ascertained. The line geometry of the master form, as represented in the form definition data set, is compared with the line geometry of the image of the completed form, resulting in skew and offset correction values that are passed on to the later stages of the process. This process of measuring the skew and offset of completed form images occupies an interval of time.
In business applications using many types of master form documents, many form definition data sets will be stored in the system. In many business applications, completed forms corresponding to many different types of master forms are received and processed on the same day. The volume of completed forms and the diversity of their master form types, makes it important to minimize the time required to perform the forms recognition process.