1. Field of Invention
This invention generally relates to processing text passages that are subjected to character recognition processes.
2. Description of Related Art
Digitizing paper documents generally involves creating a bitmap image of the paper document using a scanner or similar device and then storing the bitmap image in a computer system. To retrieve and evaluate bitmap images, the computer must recognize characters within the bitmap image created by the scanner. Character recognition techniques, for example, optical character recognition (OCR) techniques, are generally used to convert images of characters, usually provided to the computer system in some standard format, such as, for example, the tagged image file format (TIFF), into machine-legible coded form of those characters, such as, for example, ASCII or Unicode.
In carrying out this conventional conversion process, some fraction of the characters may not be converted correctly. Some basic OCR errors include, for example: substitution, where one character in a text passage is mistaken for another; deletion, where the correct character is missing; and insertion, where a spurious character is introduced. Often times, post-OCR correction of the document image must be performed in order to maintain acceptable document content accuracy.