1. Field of Invention
This invention relates to methods and systems for document image decoding.
2. Description of Related Art
Document image decoding (DID) is a method for recognizing text contained within document images that is based on a communications systems view of the document composition, printing, degradation, and scanning processes. Among the advantages of document image decoding are high recognition accuracy in situations where extensive customization is allowable, the ability to recognize some higher-level structure along with the text, and the ability to extend and improve a system within a consistent probabilistic framework.
In the document image decoding framework, document images are regarded as having been produced by transitioning through a Markov source, which is a probabilistic finite-state machine. The source begins in a special start state and terminates in a special stop state. Each transition within the source inputs a portion of the image, e.g., a bitmap, on the page at a current cursor location, outputs a recognized character, then advances that location by a two-dimensional vector displacement in preparation for recognizing the next character. The transitions are selected based on the degree of correspondence between the input image portion and a character template associated with the output character. The set of character templates includes white space of various kinds.
Formally, each transition in the source is assigned a four-tuple consisting of a character template, a two-dimensional displacement by which to advance a cursor, the prior probability of following that transition, and a string label comprising, for example, the recognized character. It should be appreciated that, conventionally, the amount of information encapsulated in the prior probability is quite limited. For instance, the prior probability does not take into account what previous transitions might have occurred on the same path through the Markov source.
Every complete path through the source defines a document image and an associated transcription. In general, the image is the union of the bitmaps imaged on each transition. In general, the transcription is the concatenation of the associated string labels. It should be appreciated that more than one complete path through the source may give rise to the same image and/or the same transcription.
After the document image has been formed, the document image is assumed to have been subjected to some form of random corruption process prior to the recognition process, which causes some uncertainty in the recognition process. In general, recognition proceeds by finding a complete path through the hypothesized Markov source that “best” explains the observed image. Specifically, recognizing the document image comprises seeking a complete path through the source that is most probable considering the entire document image as evidence. In general, the probability for the each complete path is determined on the basis of the prior probabilities of the transitions, the likelihood of the associated imaged templates, and the random corruption process. Because multiple paths can correspond to the same transcription, choosing the most probable complete path is not the same as choosing the most probable transcription.
The probability of a transcription is properly determined by summing the probabilities of all of the complete paths that are consistent with that transcription. Nevertheless, experience has shown that choosing a message with the greatest complete-path probability is usually a good approximation to choosing the message with the highest posterior probability. This is known as the Viterbi approximation to the maximum a posteriori probability (MAP) decision rule.