1. Field of the Invention
The present invention relates generally to alignment of two text sequences, e.g., of a optical character recognition (OCR) output with an electronic version of the same document. More specifically, systems and methods for character-by-character alignment of two character sequences (such as OCR output from a scanned document and an electronic version of the same document) using a Hidden Markov Model (HMM) in a hierarchical fashion are disclosed.
2. Description of Related Art
Scanning books, magazines, and other printed material into digital form has become more common with the advent of improved imaging, storage and distribution techniques. Many institutions, such as the libraries, universities, bookstores, and private enterprises have vast collections of documents. By converting these documents into electronic form, such institutions can reduce the cost of storage, facilitate remote access, enable simultaneous access by multiple users, facilitate search and retrieval of information, and/or protect information in rare or out-of-print works from loss or destruction.
Once the content of a document is scanned, the digitally recorded image can be manipulated or otherwise processed. For example, preprocessing algorithms may be performed to de-warp, reformat, supplement with additional information, and/or compress the digitally recorded image. After performing the preprocessing algorithms, the preprocessed image may be processed with optical character recognition (OCR) software and may be indexed to facilitate electronic search. Thus, scanning and recording of documents facilitates the creation of digital libraries that can be remotely and simultaneously accessed and searched by multiple users.
Various factors may affect the accuracy of the OCR output. For example, each preprocessing algorithm performed on the digitally recorded images as well as the particular OCR software engine utilized may affect the accuracy of the OCR output. In addition, the imaging conditions and/or the original (hardcopy) document itself may also affect the accuracy of the OCR output, depending on, for example, the contents of the document (e.g., language, font, font size, page size, margins, text format such as columns, embedded images, etc.), the imaging conditions (e.g., operator, positioning of the document, camera zoom, camera focus, camera angle, and the like), etc. Thus, it would be desirable to provide systems and methods to facilitate evaluation of OCR software engines and/or facilitate evaluation of the effects of preprocessing algorithms performed on the scanned document on the accuracy of the OCR output.