The present invention is directed to a method and system for correcting misrecognized words in electronic documents that have been produced by an optical character recognition system that scans text appearing on a physical medium, and in particular, to a method and system that relies on a plurality of confusion sets to select a replacement word for each misrecognized word in the document.
Devices that are used in conjunction with optical character recognition ("OCR") techniques have been in use for some time. Examples of such devices are optical scanners and facsimile machines. What is common to both of these types of devices is that they each scan a physical document bearing printed or handwritten characters in order to produce an electronic image of the original document. The output image is then supplied to a computer or other processing device, which performs an OCR algorithm on the scanned image. The purpose of the OCR algorithm is to produce an electronic document comprising a collection of recognized words that are capable of being edited. The electronic document may be formatted in any one of a plurality of well known applications. For example, if the recognized words are to be displayed on a computer monitor, they may be displayed as a Microsoft WORD.RTM. document, a WORDPERFECT.RTM. document, or any other text-based document. Regardless of how the recognized words of the electronic document are formatted, the recognized words are intended to correspond exactly, in spelling and in arrangement, to the words printed on the original document.
Such exact correspondence, however, does not always occur; as a result, the electronic document may include misrecognized words that never appeared in the original document. For purposes of this discussion, the term "word" covers any set of characters, whether or not the set of characters corresponds to an actual word of a language. Moreover, the term "word" covers sets of characters that include not only letters of the alphabet, but also numbers, punctuation marks, and such typographic symbols as "$", "&", "#", etc. Thus, a misrecognized word may comprise a set of characters that does not comprise an actual word, or a misrecognized word may comprise an actual word that does not have the same spelling as that of the corresponding word in the scanned document. For example, the word "got" may be misrecognized as the non-existent word "qot", or the word "eat" may be recognized as "cat." Such misrecognized words, whether they comprise a real word or a mere aggregation of characters, may be quite close in spelling to the words of the original document that they were intended to match. The cause of such misrecognition errors is largely due to the physical similarities between certain characters. For example, as discussed above, such errors may occur when the letter "g" is confused with the physically similar letter "q". Another common error that OCR algorithms make is confusing the letter "d" with the two-letter combination of "ol." The physical resemblance of certain characters is not the only cause of recognition errors, however. For example, the scanning device may include a faulty optical system or a defective charge-coupled device (CCD); the original document may be printed in a hard-to-scan font; or the original document may include scribbles and marks that obscure the actual text.
Certain techniques have been implemented to detect and correct such misrecognition errors. For example, if the electronic document containing the recognized words is formatted in a word processing application, a user viewing the document may use the spell checking function provided by the word processing application to correct any words that have been misspelled. Some of these word processing applications also provide a grammar checker, which would identify words that, although spelled correctly, do not belong in the particular sentences in which they appear.
A drawback to these techniques is that a user must manually implement these correction techniques because spell checkers and grammar checkers operate by displaying to the user a list of possible words that may include the correct word. By manipulating an appropriate sequence of keys or other data input means, a user must select from this list what he believes to be the correct word and implement the appropriate commands for replacing the misrecognized word with the selected word. Such a correction technique is time-consuming, and moreover, is prone to human error because, in carrying out such operations, the user may inadvertently select an inappropriate word to replace the misrecognized word. What is therefore needed is a correction technique that automatically replaces each misrecognized word with the word most likely matching the corresponding word in the original document. Such a correction technique would not require user intervention.