1. Field of Disclosure
The disclosure generally relates to the field of optical character recognition (OCR), in particular to OCR output quality improvement.
2. Description of the Related Art
Digitizing printed documents (e.g., books, newspapers) typically involves image scanning, which generates images of the printed documents, and optical character recognition (OCR), which converts the images into editable text. Due to imperfections in the documents, artifacts introduced during the scanning process, and shortcomings of OCR applications (hereinafter called OCR engines), errors often exist in the output text. The word level accuracy rates of current OCR engines range between 80% and 95% for Latin script based languages. These accuracy rates meet the demands of many applications, but they do not meet the demand of other applications such as Text-to-Speech and Republishing. Therefore, there is a desire and need to efficiently identify and correct OCR errors.