Automatic speech recognition systems based on hidden Markov models (HMMs) and other stochastic grammar frameworks are known. Examples are described in. U.S. Pat. No. 5,199,077, and reference [5]. Bracketed numerals identify referenced publications listed in Appendix A. HMMs have also been applied to document image recognition problems. The most comprehensive of these attempts, called document image decoding (DID), is based on an explicit communication theory view of document recognition. See U.S. Pat. No. 5,321,773, and references [6,7,8,9,10]. In the DID model, as shown in FIG. 1 of U.S. Pat. No. 5,321,773, a stochastic message source selects a finite string M from a set of candidate strings according to a prior probability distribution. The imager converts the message into an ideal binary image Q. The channel maps the ideal image into an observed image Z by introducing distortions due to printing and scanning, such as skew, blur and additive noise. Finally, the decoder receives image Z and produces an estimate M of the original message according to a maximum a posteriori (MAP) decision criterion.
The structure of the message source and imager is captured formally by modeling image generation using a Markov source, as depicted in FIG. 1. Decoding an image with respect to a Markov source involves finding the best (MAP) path through a three dimensional (3D) decoding trellis indexed by the nodes of the model and the coordinates of the image plane. A straightforward approach to MAP decoding is to use a two-dimensional form of segmental Viterbi algorithm to compute a set of recursively-defined likelihood functions at each point of the image plane. The time complexity of Viterbi image decoding is O(.parallel..beta..parallel..times.H.times.W), where .parallel..beta..parallel. is the number of branches in the source model and H and W are the image height and width, respectively, in pixels. Although the computation grows only linearly with image size, in absolute terms it can be prohibitive. For example, a simple three state model for a column of text in a single known font takes about 45 minutes to run on an 8.5 in.times.11 in image scanned at 300 dpi resolution. Thus, methods for decreasing the required computation are essential if DID is to become a practical approach to document image analysis.
In the case of text column transcription, the complexity of image decoding arises because, in effect, each row of the image is evaluated as a possible text line. Thus, for example, line decoding occurs 3300 times for an 11 in image (300 dpi.times.11 in). Conventional approaches to document recognition attempt to avoid this problem by performing text recognition only on the actual text lines of a column, which typically number less than 50. They do this by using simple segmentation algorithms, such as horizontal pixel projection, to detect and locate text lines prior to recognition[2].
Conventional segmentation algorithms could be used prior to image decoding as well, in a manner analogous to their use with 1D HMM-based approaches to text recognition[4,5]. However, pre-decoding segmentation can be unreliable, particularly for highly degraded (noisy) images. Moreover, since conventional segmentation algorithms are not based on a rigorous probabilistic formulation, their use would negate many of the theoretical advantages of DID.