1. Technical Field
The invention disclosed broadly relates to data processing systems and methods and more particularly relates to techniques for the repair of character recognition information derived from scanned document images.
2. Background Art
Data processing systems and methods have been devised to capture the image of hard copy documents for display, communication and archiving. The process of capturing the image of a document starts with scanning the hard copy document in front of an image scanning device which converts the black and white or, gray level or color object shapes into corresponding picture elements represented by a bit map array. The bit map array can be selectively compressed to remove redundancy through techniques such as run length encoding The compressed image file can then be efficiently transmitted over data communications links and stored in conventional data storage devices However, the information content of the bit map for the image is not in the coded data format which can be manipulated for arithmetic and word processing applications. The conversion of the shapes of characters in a document image must be done by a character recognition step.
Character recognition makes use of pattern recognition processes to convert the shapes of images representing characters in the bit mapped image, into character codes such as the ASCII alphanumeric character code. Character recognition outputs character strings which can be used to generate addresses for the storage or transmission of the document image, this process being referred to as auto indexing. Character recognition can also be used to provide character strings to program applications, derived from the hard copy documents scanned into the system.
An example of a document image archiving system can be found in U.S. Pat. No. 5,058,185 to Morris, et al. entitled "Object Management and Delivery System Having Multiple Object Resolution Capability," which is assigned to the IBM Corporation and incorporated herein by reference.
The process of locating meaningful portions of the document image which contain information useful to auto indexing or to application programs is made easier by the use of hard copy forms. A hard copy form will provide a pre-defined location for the specification of words and phrases representing categories of information meaningful to both auto indexing and application programs. For example, a hard copy form can have the identity of the form in a pre-specified location, to enable the system to quickly access a master form definition to identify the location of other meaningful character images in the document image. Other fields can be pre-specified in the master form definition to locate other meaningful categories containing character images for character recognition. The master form definition can also include a specification of the code page for characters expected to be represented in particular fields on the form.
A problem which occurs in the character recognition of information fields on the image of a document form is the appearance of extraneous marks and misaligned images on the form. A technique to overcome problems of extraneous marks and misregistration or misalignment of images on a form document is described in the co-pending U.S. patent application Ser. No. 07/305,828, filed Feb. 2, 1989, now U.S. Pat. No. 5,140,650, by R. G. Casey and D. R. Ferguson entitled "A Computer Implemented Method for Automatic Extraction of Data From Printed Forms," assigned to the IBM Corporation and incorporated herein by reference.
As described by Casey and Ferguson, a blank master form can be scanned into the system and its digital image stored. Each type of form which is to be recognized must first be defined to the system. That master form definition can include a fingerprint of the master form's image which will be used to confirm the correctness of the form and to verify that the entire incoming form was completely scanned. Also as a part of the master form definition, a bar code may be included which is associated with the form or other numeric or identifying information can be included to identify the form. In addition, the coordinates of all defined fields are provided for the form in the master form definition.
The image of the hard copy document form which has been scanned into the system is examined and its features are compared with the master form definition. The line geography of the scanned image is compared with a definition of the line geography for the master form. This is a test of whether the horizontal or vertical line specified for each interior node in the master form definition, exists on the scanned image of the input document form. Alternately, a bar code may be associated with each form that has been defined in the master form definition data set. If a bar code is to be employed to identify the form, then scanning can proceed from one side of the document image to the other and bar code information identified. When the bar code is found, it will be used to determine the identity of the form. Once the identity of the form has been determined, the master form definition for that particular form type can be accessed to determine the location of all of the fields within the form image. Alternately, the identity of the form can also be input by the user from the keyboard or other input device, enabling access of the intended form definition. Reference can also be made to the publication by R. G. Casey and D. R. Ferguson, "Intelligent Forms Processing," IBM Systems Journal, Vol. 29, No. 3, 1990, pp. 435-450, for additional details on the process of form recognition.
After a form has been recognized for its form type and its corresponding master form definition has been accessed, the coordinates of each of the fields for which character recognition is to be applied are now available. Now that the coordinates of each respective field are available for character recognition, a clean image of each character string must be lifted from the overall document image. Typically, the document fields will have extraneous marks or misregistered or misaligned character strings and the effects these defects must be eliminated or reduced. This is accomplished by the step of field extraction. Once the form is identified and verified, data from the form's fields must be extracted. This begins with identifying any image skew and offset. The master coordinates for the fields on the form must be adjusted to compensate for the skew and the offset of the incoming form image. Next, field adjustment must be performed. The boundaries of each field must be checked to determine if data extends beyond the boundaries. If data overlaps the field boundaries, the area of image lift must be extended outside of the field boundary. Next, extraneous line removal must be accomplished. When extraneous lines are identified, those lines must be removed from the field image without damaging the character images within the field. The process of field extraction is described in more detail in the above referenced co-pending U. S. patent application by Casey and Ferguson and is also described in the above referenced technical article by Casey and Ferguson.
After the field image has been extracted from the overall document image of the form, a character recognition must be performed to convert the shapes in the extracted field image into alphanumeric character representations such as ASCII. The master form definition will include information on the code page which characterizes the characters expected to be present in each respective field on the form. For single byte character sets, SBCS, such as in Latin languages, the code page will be specified. For double byte character sets, DBCS, languages such as Kanji character, Mandarin, or other oriental characters, the appropriate code page will be specified in the master form definition data set.
The process of character recognition takes bi-level images and performs pattern recognition operations, returning ASCII-coded data representing the recognized characters. Unrecognized characters are flagged and their location in the character string is identified. Suspicious characters are marked as being any character that is recognized with a certainty level that is below the established certainty for properly recognized characters. Further information on the character recognition process can be found in the above referenced co-pending U. S. patent application by Casey and Ferguson and also in the above referenced technical article by Casey and Ferguson.
The recognition of the images of a bar code representing the form identity in the scanned document image, can be better understood with reference to the U.S. Pat. No. 4,992,650 by Somerville entitled "Method and Apparatus for Bar Code Recognition in a Digital Image," assigned to the IBM Corporation and incorporated herein by reference.
A problem in the prior art of accurate character recognition of character strings in scanned document images, is the need to repair misrecognized character strings. Typically, techniques for repairing misrecognized character strings will depend upon the type of information expected for a particular character string and the code page representation expected for that information. For example, if numeric information is expected to be placed in a particular field, then the character recognition operation can be limited to recognizing Arabic characters and no consideration need be given to Latin character shapes. For example, if a poorly represented numeral "4" occurred in a field identified as a numeric field, then no attempt would be made by the recognition operation to interpret the shape as a "P." Alternately, if a field is identified by the master form definition as being a given name field, for example, then character strings in that field can be verified by comparing with a lexicon of conventional given names. Similarly, if a particular field is defined in the master form definition as being for the name of a state, then the lexicon for conventional state names can be used to compare and validate poorly recognized character strings in that field. Alternately, if Kanji character information is to be represented in a particular field as defined by the master field definition, then a still different form of comparison and validation should be used for that field.
Since the types of unrecognized character repair processes are diverse and depend upon the field type, among other determinants, a diversity of processes may be required to handle the repair of misrecognized fields occurring on the same document form. Some mechanism is needed to keep track of the history of repair and the requirements for repair of particular fields which have been misrecognized on a document form image.
Another problem is maintaining an audit trail of the repair history for particular fields which have been misrecognized on a document form. For example, if an application makes use of a particular field to index the document image in an image archiving system, and if the character repair for misrecognized characters in the field is defective, the archived image will be misfiled in the system. If this were a medical record, for example, and if the misfiling of this document image resulted in significant liability to the user, such as an insurance company, some means should be available to trace the repair history of that field.
Still further, where an attempt is made to improve the repair processing for misrecognized character strings, the accessibility of the repair histories for previously processed fields would be useful in assessing the effectiveness of new techniques for character repair.
Still further, where sequential stages of character repair require information from a prior stage of character repair in order to perform the subsequent repair stage, some means is needed to track the history for the repair of the misrecognized and suspicious characters in the field.