The field of document imaging is growing rapidly, as modem society becomes more and more digital. Documents are stored in digital format on databases, providing instantaneous access, minimal physical storage space, and secure storage. Today's society now faces questions on how best to transfer its paper documents into the digital medium.
The most popular method of digitizing paper documents involves using a system comprising a scanner and a computer. The paper documents are fed into a scanner, which creates a bitmap image of the paper document. This bitmap image is then stored in the computer. The computer can take a variety of forms, including a single personal computer (PC) or a network of computers using a central storage device. The bitmapped images must be able to be retrieved after they are stored. One system for filing and retrieving documents provides a user interface which allows a user to type in a search term to retrieve documents containing the search term. Preferably, the system allows the user to type in any word that the user remembers is contained within the desired document to retrieve the desired document. However, in order to retrieve documents on this basis, the document must be character recognized. That is, the computer must recognize characters within the bitmapped image created by the scanner.
Another common usage of digitizing documents is to digitize long paper documents in order to allow the document to be text searched by the computer. In this usage, a user types in the key word the user is looking for within the document, and the system must match the search term with words found within the document. For these systems, the document must be character recognized as well.
The most common method of recognizing characters is by using an optical character recognition (OCR) technique. An optical character recognition technique extracts character information from the bitmapped image. There are many different types of optical character recognition techniques. Each has its own strengths and weaknesses. For example, OCR 1 may recognize handwriting particularly accurately. OCR 2 may recognize the Courier font well. If OCR 1 is used to recognize a document in Courier font, it may still recognize the majority of the characters in the document. However, it may recognize many of the characters inaccurately. A user may not know of an OCR's strengths and weaknesses. A user may not know whether or not the types of documents the user typically generates are of the kind that are accurately recognized by the OCR present on the user's system. Current systems do not inform the user of the quality of the recognition of the OCR technique. The user finds out how accurate the recognition was only by using the document for the purpose for which it was stored into the computer system, at which time it may be too late to correct.
An inaccurately recognized document can lead to several problems. First of all, in a system in which documents are stored and retrieved based on their contents, an inaccurately recognized document may become impossible to retrieve. For example, if a user believes the word "imaging" is in a specific document, the user will type in "imaging" as the search term. However, if the word "imaging" is recognized incorrectly, such that it was recognized as "emerging," the user's search will not retrieve the desired document. The user may not remember any other words in the document, and thus the document is unretrievable. In a system where documents are digitized to allow text searching of the document, the same problem occurs. Misrecognized words are not found by the use of the correct search terms.
Thus, there is a need to allow the user to determine whether a recognized word is of acceptable quality. By allowing the user to determine whether a word is of acceptable quality, the user can ensure that the document is retrieved by the use of that word as a search term. Also, a user can ensure that words within the document are accurately recognized for internal document searching purposes. Additionally, in a system with multiple optical character recognition techniques, there is a need to be able to compare the accuracy of the different versions of the document to create a version that is the most accurate.