Optical character recognition (OCR) is a computer-based translation of an image of text into digital form as machine-editable text, generally in a standard encoding scheme. This process eliminates the need to manually type the document into the computer system. A number of different problems can arise due to poor image quality, imperfections caused by the scanning process, and the like. For example, a conventional OCR engine may be coupled to a flatbed scanner which scans a page of text. Because the page is placed flush against a scanning face of the scanner, an image generated by the scanner typically exhibits even contrast and illumination, reduced skew and distortion, and high resolution. Thus, the OCR engine can easily translate the text in the image into the machine-editable text. However, when the image is of a lesser quality with regard to contrast, illumination, skew, etc., performance of the OCR engine may be degraded and the processing time may be increased due to processing of all pixels in the image. This may be the case, for instance, when the image is obtained from a book or when it is generated by an imager-based scanner, because in these cases the text/picture is scanned from a distance, from varying orientations, and in varying illumination. Even if the performance of the scanning process is good, the performance of the OCR engine may be degraded when a relatively low quality page of text is being scanned.
One step in the OCR process is word recognition. The recognized words are intended to correspond exactly, in spelling and in arrangement, to the words printed on the original document. Such exact correspondence, however, can be difficult to achieve. As a result, the electronic document may include misrecognized words that never appeared in the original document. For purposes of this discussion, the term “word” covers any set of characters, whether or not the set of characters corresponds to an actual word of a language. Moreover, the term “word” covers sets of characters that include not only letters of the alphabet, but also numbers, punctuation marks, and such typographic symbols as “$”, “&”, “#”, etc. Thus, a misrecognized word may comprise a set of characters that does not comprise an actual word, or a misrecognized word may comprise an actual word that does not have the same spelling as that of the corresponding word in the scanned document. For example, the word “got” may be misrecognized as the non-existent word “qot”, or the word “eat” may be recognized as “cat.” Such misrecognized words, whether they comprise a real word or a mere aggregation of characters, may be quite close in spelling to the words of the original document that they were intended to match. The cause of such misrecognition errors includes the OCR performance problems discussed above. In addition, misrecognition errors arise from the physical similarities between certain characters. For example, as discussed above, such errors may occur when the letter “g” is confused with the physically similar letter “q”. Another common error that OCR algorithms make is confusing the letter “d” with the two-letter combination of “ol.”