There is a need for a large scale digitization of machine printed documents. Depending on the quality of the scanned data, the OCR level per character ranges from 70% to 99%. This results in the need for an optimized process of key-in system to support the fast verification of the recognized data. As discussed in the following paragraphs, there are several known solutions to this problem.
In the Smart Key method, which uses the concept of carpets, symbols with the same OCR classification are grouped together and shown to the operator on a single screen. In particular, U.S. Pat. No. 5,455,875, to Chevion et al. describes a system and method for quality control and correction of computer-generated OCR data by a human operator. The system can be configured to display to a human operator a full screen of images of individual characters from scanned documents, which were classified by OCR as being the same character. This type of image is referred to as a “carpet.” Errors in the OCR classification are manifested as character images that do not fit the displayed classification and stand out clearly against the correct images in the carpet. For example, if the OCR erroneously classifies an “O” as an “L,” “S,” or “6”, the operator will see an image of a incorrectly classified character in a screen full of O's, as shown in FIG. 1. This type of discrepancy is very easy for the human operator to spot and mark on the screen. The image of the field that was read erroneously by the OCR is then displayed so that the operator (or another operator) can type in the correct character.
This method is efficient only if a very high percentage of the characters are classified correctly by OCR (e.g., in the 97-98% range). After the operator rejects the characters that have been classified incorrectly by OCR, the rejected characters are routed to the manual data entry process. This method allows the user to key-in only the characters incorrectly characterized by OCR. That is, in case of 99% recognition level of correct classification by OCR, the operator only has to key-in only 1% of the data. This method utilizes a human's ability to recognize defects in the context of a large body of similar images. With the Smart Key method operators have to validate significant number of characters. Due to the “carpet of symbols,” such validation is relatively fast. However, a disadvantage to the Smart Key method is, for large texts (e.g., books and machine printed documents), it is a costly approach.
In side-by-side (SBS) verification method, both document image and its recognition is being shown to the operator for verification. The operator fixes the miss recognized data. However, a disadvantage of this method is that it is slow since the operator's eyes needs to go back and forth from the image to the recognition content.
In the In-Place verification method, the recognized data is overlaid on the image and the user can toggle between the image and the recognized data. Though this method is significantly faster than SBS, a disadvantage of this method is that it is slower than Smart-Key. Therefore, there is a need in the art for a fast method for correcting incorrectly classified OCR symbols that is not costly for large texts, such as machine printed documents.