The problem at hand is to recognize the textual information contained in a scanned image. The scanned images can come from a wide variety of source material, such as written answers to printed questions on forms, or mailing addresses on postal envelopes.
A form is a document which is easily partitioned into a fixed number of small fields, each having certain simple, syntactic and/or semantic properties. The data can be machine printed or hand printed. Typically, a machine-printed form has printed legends or instructions for the various fields, and a person filling out the form inserts information, in his/her handwriting, in the fields. Thus, both the machine-printed legends and instructions, and the handwriting, are present in the image of the form. OCR is performed to obtain the handwritten information.
OCR systems employ various strategies for isolating small portions of the image (such as groups of numeric digits within telephone numbers or ZIP codes) as connected components, segmenting a connected component into one or several character images and recognizing each such image as representing a specific character.
Conventional OCR recognition engines exist, which recognize characters with a reasonable accuracy. However, "reasonable accuracy" is generally not good enough: minor errors, such as an erroneous digit in a dollar value, can have devastating impacts on the user.
The OCR technology that is required to perform this job has made substantial progress in the last decade, but raw OCR results are still--and will remain, probably forever--relatively unreliable. A 98% recognition rate at the character level leaves many fields with an error. At 90% recognition rate, less than half of the words in a form are error-free. But 90% is even out of reach when the form is hand-printed or when the characters are not "boxed" (the latter situation often leading to segmentation errors).
In practically all cases, some form of error correction needs to be done after OCR. Therefore a practical system must include extra processing of some kind in order to improve the results, either automatically through the exploitation of context, or by supporting operator intervention to verify and correct the information.
An OCR system may be designed to identify several alternatives for segmenting a connected component, and several character choices for each character inside a segmentation alternative. The results are typically provided from the output of the OCR system to an application program, such as a text processor or a printer driver.
It is well known that the use of context information in conjunction with OCR helps to improve the level of accuracy realized. For instance, suppose some connected components are identified as making up a ZIP code (which consists only of a sequence of numeric characters). It will then be true that any character choice, for a character within the connected component, which is not a numeric character can be dismissed as an incorrect choice. Since a Zip code has a known number of digits, any character choice which would imply a different number of digits can be dismissed. Finally, since only a subset of the possible permutations of digits are actual Zip codes, valid and in use, any character choice which would imply an invalid Zip code may be dismissed.
In conventional systems, the OCR subsystem simply provides any character choices it recognizes to the application program, and the exploitation of context is performed by the application program. Either the application program performs error correction automatically, or it provides a user interface for operator intervention.
However, there are drawbacks in such an approach: speed and accuracy are not as satisfactory as they ideally could be. Therefore, the challenge facing OCR system designers is how to integrate character verification and correction with context checking, to improve the speed and accuracy of an automatic recognition system beyond that of conventional systems.
The technology of such post-OCR error correction has been limited to methods for performing a linear sequence of operations. For instance, an automatic step is performed, of invoking the recognizer and then checking for constraints to be verified. If a field does not satisfy the constraints, then the field is shown to an operator, who corrects it.
Another sequence may be to recognize the characters, and then rely on operator intervention to certify or re-enter individual characters that have been recognized with a confidence less than a certain threshold. Such a system works only at the character level. Field verification is done afterwards.
A variation of this technique includes certifying all characters, using the concept of carpets. Documents are recognized in a batch, and recognition is done for several hundreds or thousands of characters. Then, all characters recognized as `1` can be shown together on the screen, forming a "carpet". An operator easily recognizes (and clicks on) the misrecognized characters and corrects them. The same is done for all 2's, 3's etc. This method is efficient only if a very high percentage of the characters are recognized correctly (97-98% range). Because of this requirement, the method, when used alone, is of little help for handwritten letters. Of course, checking the constraints can always be done afterwards.
Another technique is described in co-pending, co-assigned U.S. patent application Ser. No. 08/325,849, Lorie, "Optical Character Recognition System Having Context Analyzer." This system uses both syntactical and semantic rules.
However, these prior art methods have in common the fact that they are limited to performing operations in a predetermined order.
If, after recognition, characters that are doubtful are presented to an operator for character-level verification or re-entry, the system over-emphasizes the importance of correctly identifying individual characters, and fails to make the most effective use of the contextual information that is available.
On the other hand, if the system immediately rushes into context exploitation, it may do so on results that are excessively poor, so that the contextual correction process is mis-led by the errors, and applies incorrect contextual information. The result is that correct characters may be "corrected" to incorrect values because of the inappropriate contextual information.
Because of these factors, the conventional context analyzer will perform poorly and will be slow. This is particularly true of fuzzy searches of dictionaries. Since no character is absolutely certain, it is difficult to use indexing to address directly a subset of the entries in the dictionaries.