The ability to accurately recognize characters by scanning hard copy images is extremely important for many forms of automated data processing and has wide applications ranging from automatic text recognition for word processing to banking applications wherein numerical data is scanned, processed and stored. Accuracy is essential in most applications.
A great deal of effort has been devoted to correcting errors which invariably result from commercially available OCR devices. In some commercially available OCR devices, a confidence level indication is provided with the recognized character in order to permit flagging certain characters which are recognized with a low degree of confidence. In such cases, an operator may subsequently check the flag character and correct it if necessary by reference to the original scanned hard copy image.
An additional method for lowering the error rate for OCR devices is that of employing multiple OCR units each with their associated confidence factor and the use of certain algorithms for combining these confidence factors to obtain the most probable character. See for example U.S. Pat. No. 5,257,323; U.S. Pat. No. 5,418,864; and U.S. Pat. No. 5,455,872 incorporated herein by reference.
While the above approaches to improving error rates in OCR devices have some advantages, the current OCR recognition techniques are still inadequate for a number of reasons. First, the various confidence factors provided by the manufacturer of OCR equipment may, in fact, be incorrect and there are numerous instances in which characters are identified incorrectly without an associated low confidence factor, as for example, in substitution errors. The character "c", for example, may be incorrectly recognized as a character "e" with a 100% degree of confidence so in this case the OCR device does not detect a clear substitution error. In the multiple OCR environment, the typical majority voting or weighing techniques utilized in the prior art may, and often do, yield incorrect results. Further, most systems have no mechanism for dealing with characters which are completely unrecognizable.
Of course, manual techniques have also been utilized for checking the accuracy of scanned data. Some approaches utilize both machine scanning/recognition (OCR) and manual keying so that the data has effectively has been input twice, once by machine and once by hand. Subsequent machine comparisons then reject characters on the basis of inconsistencies in the two sets of data. Such manual keying is tedious, time consuming and expensive.