The present invention relates generally to image decoding and image recognition techniques, and in particular to such techniques using stochastic finite state networks such as Markov sources. In particular, the present invention provides a technique for efficiently integrating a language model into a stochastic finite state network representation of a text line image, for use in text line image decoding.
Stochastic grammars have been applied to document image recognition problems and to text recognition in particular. See, for example, the work of Bose and Kuo, identified in reference [1] which uses hidden Markov models (HMMs) for word or text line recognition. Bracketed numerals identify referenced publications listed in the Appendix of Referenced Documents. See also U.S. Pat. No. 5,020,112, issued to P. A. Chou and entitled xe2x80x9cImage Recognition Using Two-Dimensional Stochastic Grammars.xe2x80x9d
U.S. Pat. No. 5,321,773 (hereafter the ""773 DID patent), issued to Kopec and Chou, discloses a document recognition technique known as Document Image Decoding (hereafter, DID) that is based on classical communication theory. This work is further discussed in references [2], [3] and [3]. The DID model 800, illustrated in FIG. 14, includes a stochastic message source 810, an imager 811, a channel 812 and a decoder 813. The stochastic message source 810 selects a finite string M from a set of candidate strings according to a prior probability distribution. The imager 811 converts the message into an ideal binary image Q. The channel 812 maps the ideal image into an observed image Z by introducing distortions due to printing and scanning, such as skew, blur and additive noise. Finally, the decoder 813 receives observed image Z and produces an estimate {circumflex over (M)} of the original message according to a maximum a posteriori (MAP) decision criterion. Note that in the context of DID, the estimate {circumflex over (M)} of the original message is often referred to as the transcription of observed image Z.
The structure of the message source and imager is captured formally by combining their functions into a single composite image source 815, as shown by the dotted lines in FIG. 14. Image source 815 models image generation using a Markov source. A Markov source is a stochastic finite-state automaton that describes the spatial layout and image components that occur in a particular class of document images as a regular grammar, representing these spatial layout and image components as a finite state network. A general Markov source model 820 is depicted in FIG. 15 and comprises a finite state network made up of a set of nodes and a set of directed transitions into each node. There are two distinguished nodes 822 and 824 that indicate initial and final states, respectively. A directed transition t between any two predecessor (Lt) and successor (Rt) states in the network of FIG. 15 has associated with it a 4-tuple of attributes 826 comprising a character template, Q, a label or message string, m, a transitional probability, xcex1, and a two-dimensional integer vector displacement, xcex94. The displacement indicates a horizontal distance that is the set width of the template. The set width of a template specifies the horizontal (x-direction) distance on the text line that the template associated with this transition occupies in the image.
Decoding a document image using the DID system involves the search for the path through the finite state network representing the observed image document that is the most likely path that would have produced the observed image. The ""773 DID patent discloses that decoding involves finding the best (MAP) path through a three-dimensional (3D) decoding trellis data structure indexed by the nodes of the model and the coordinates of the image plane, starting with the initial state and proceeding to the final state. Decoding is accomplished by a dynamic programming operation, typically implemented as a Viterbi algorithm. The dynamic programming operation involves computing the probability that the template of a transition corresponds to a region of the image to be decoded in the vicinity of the image point. This template-image probability is represented by a template-image matching score that indicates a measurement of the match between a particular template and the image region at the image point. Branches in the decoding trellis are labeled with the matching scores. A general description of the implementation of the Viterbi algorithm in the context of Document Image Decoding is omitted here and is provided in the discussion of an implementation of the present invention in the Detailed Description below.
U.S. Pat. No. 5,526,444 (hereafter, the ""444 ICP patent) issued to Kopec, Kam and Chou and entitled xe2x80x9cDocument Image Decoding Using Modified Branch-And-Bound Methods,xe2x80x9d discloses several techniques for improving the computational efficiency of decoding using the DID system. The ""444 ICP patent disclosed the use of a class of Markov source models called separable Markov models. When a 2D page layout is defined as a separable Markov source model, it may be factored into a product of 1 D models that represent horizontal and vertical structure, respectively. The ""444 ICP patent further discloses that decoding with a separable model involves finding the best path through the 2D decoding trellis defined by the nodes of the top-level model, some of which are position-constrained, and the vertical dimension of the image. The computational effect of a position constraint is to restrict the decoding lattice for a node to a subset of the image plane, providing significant computational savings when used with standard Viterbi decoding.
The ""444 ICP patent further discloses the use of a recursive Markov source. A recursive source is a collection of named sub-sources each of which is similar to a constrained Markov source except that it may include an additional type of transition. A recursive transition is labeled with a transition probability and the name of one of the Markov sub-sources. The interpretation of a recursive transition is that it represents a copy of the named sub-source. Thus, some of the transitions of the top-level vertical model are labeled with horizontal models. One aspect of each of the horizontal models is that every complete path through the model starts at a fixed horizontal position and ends at a fixed horizontal position, effectively reducing decoding to a one-dimensional search for the best path. A second aspect is that the vertical displacement of every complete path in the model is a constant that is independent of the vertical starting position of the path. Thus, the horizontal models describe areas of the image plane that are text lines, and the top-level vertical model with its nodes that are constrained by position defines which rows of pixels in the 2D image are to be considered as potential text lines. The match score for each branch is computed by running the horizontal model (i.e., performing the Viterbi procedure) along the appropriate row of the image. The overall decoding time for a separable model is dominated by the time required to run the horizontal models, that is, to decode individual text lines.
In conjunction with the use of separable models, the ""444 ICP patent also discloses a heuristic algorithm called the Iterated Complete Path (hereafter, ICP) algorithm that fits into the framework of the Viterbi decoding procedure utilized by DID but improves on that procedure by focusing on a way to reduce the time required to decode each of the horizontal models, or lines of text. The ICP algorithm disclosed in the ""444 ICP patent is an informed best-first search algorithm that is similar to heuristic search and optimization techniques such as branch-and-bound and A* algorithms. During decoding, ICP causes the running of a horizontal model (i.e., computes the actual template-image matching scores) for only a reduced set of transitions into each node, the reduced number of transitions being substantially smaller than the number of all possible transitions into the node. ICP reduces the number of times the horizontal models are run by replacing full Viterbi decoding of most of the horizontal rows of pixels with the computation of a simple upper bound on the score for that row. This upper bound score is developed from an upper bound function. ICP includes two types of parameterized upper bound functions. Additional information about the ICP best-first search algorithm may also be found in reference [5].
In the ""444 ICP patent, the use of a finite state model defined as a constrained and recursive Markov source combined with the ICP algorithm allow for particular transitions to be abandoned as not likely to contain the best path, thereby reducing computation time. Full decoding using the longer computation process of computing the template-image matching scores for a full horizontal line is carried out only over a much smaller number of possible transitions, in regions of the image that are expected to include text lines. The reader is directed to the ""444 ICP patent for more details about the heuristic scores disclosed therein. In particular, see the discussion in the ""444 ICP patent beginning at col. 16 and accompanying FIG. 7 therein, and refer to FIG. 23 for the pseudo code of the procedure that computes the weighted horizontal pixel projection heuristic.
U.S. Pat. No. 5,883,986 (hereafter, the ""986 Error Correction patent) issued to Kopec, Chou and Niles entitled xe2x80x9cMethod and System for Automatic Transcription Correction,xe2x80x9d extended the utility of the DID system to correcting errors in transcriptions. The ""986 Error Correction patent discloses a method and system for automatically correcting an errorful transcription produced as the output of a text recognition operation. The method and system make use of the stochastic finite state network model of document images. Error correction is accomplished by first modifying the image model using the errorful transcription, and then performing a second recognition operation on the document image using the modified image model. The second recognition operation provides a second transcription having fewer errors than the original, input transcription. The method and system disclosed in the ""986 Error Correction patent may be used as an automatic post-recognition correction operation following an initial OCR operation, eliminating the need for manual error correction.
The ""986 Error Correction patent disclosure describes two methods by which to modify the image model. The second of these modifications is particularly relevant to the subject invention, and involves the use of a language model. Language modeling used in OCR and in post-OCR processing operations is well known. See, for example, references [6], [7] and [8]. Language models provide a priori, externally supplied and explicit information about the expected sequence of character images in the image being decoded. The premise for the use of language models in OCR systems is that transcription errors can be avoided by choosing as the correct transcription sequences of characters that actually occur in the language used in the image being decoded instead of other sequences of characters that do not occur. A language model is, in effect, a soft measure of the validity of a certain transcription. A spelling corrector that ensures that each word in the transcription is a correctly spelled word from some dictionary is a simple form of language modeling. Language models may be used during the recognition operation, or as part of a post-processing correction technique. Contextual post-processing error correction techniques make use of language structure extracted from dictionary words and represented as N-grams, or N-character subsets of words. More advanced forms of language modeling include examining the parts of speech, sentence syntax, etc., to ensure that the transcription correctly follows the grammar of the language the document is written in.
In the ""986 Error Correction patent, the original errorful transcription is used to construct an N-gram language model that is specific to the language that actually occurs in the document image being decoded. The language model is then incorporated into the stochastic finite network representation of the image. Disclosure related to the language model is found at col. 53-57 in the discussion accompanying FIGS. 23-36. In particular, the construction of a binary N-gram (bigram) model and the incorporation of the bigram model into the Markov image source model are described. The effect of incorporating the language model is to constrain or influence the decoding operation to choose a sequence of characters that is consistent with character sequences allowed by the language model, even when template-image matching scores might produce a different decoding result. Some percentage of the errors in the original errorful transcription should be eliminated using the stochastic finite state network representation of the image as modified by the language model.
The powerful flexibility offered by the DID system is limited in actual use by the time complexity involved in the decoding process. The size and complexity of the image, as defined by the model (i.e., the number of transitions) and the number of templates to be matched, are major factors in computation time. Indeed, the time complexity of decoding using a two-dimensional image source model and a dynamic programming operation, is O(∥xcex2∥xc3x97Hxc3x97W), where ∥xcex2∥ is the number of transitions in the source model and H and w are the image height and width, respectively, in pixels. Incorporating a language model into the decoding operation significantly adds to decoding complexity. More generally, the direct incorporation of an mth order Markov process language model (where m greater than 0) causes an exponential explosion in the number of states in the image model. An N-gram language model corresponds to an mth order Markov process, where m=Nxe2x88x921. For example, a bigram model is a first-order Markov process. Incorporating an mth order Markov process having a total of M character templates results in an increase in computation for the dynamic programming decoding operation of a factor of Mm. For example, when the image model contains 100 templates, incorporation of a bigram model into the image model results in an increase in decoding computation of approximately a factor of 100.
The improvements provided by the technical advances disclosed in the ""444 ICP patent, while significant, did not address the efficient decoding of an individual text line using a language model within the framework of the DID system. While the ""986 Error Correction patent disclosure provides an example for using language models in a post-processing error correction operation, it does not address either the increase in computational complexity caused by the incorporation of a language model into the Markov image source or how to incorporate a language model in the initial image decoding operation.
Use of language models in the DID system provide the significant benefit of improved accuracy in the output transcription produced by decoding. Users of any text recognition system expect the system to produce virtually error-free results in a commercially practical timeframe, with little or no manual post-recognition error correction. It is desirable, therefore, to provide a method for using language models in the decoding operation in a computationally efficient manner.
The technique of the present invention provides for the efficient integration of a stochastic language model such as an N-gram model in the decoding data structure that represents a text line image in a line image decoding operation. The present invention is premised on the observation that the problem with using a stochastic language model is not the efficiency of computing the full conditional probabilities or weights for a given path through the data structure. Rather, the problem is how to effectively and accurately manage the expansion of the nodes in the decoding data structure to accommodate the fully conditional probabilities available for possible best paths in the graph, and the resulting increase in decoding computation required to produce maximum cumulative path scores at every image position.
The dynamic programming operation used for decoding is not capable of taking the prior path histories of characters into account during decoding unless each history is explicitly represented by a set of nodes and branches between nodes where the language model probabilities can be represented along with template-image matching scores. This is because the dynamic programming operation assumes that each branch is evaluated on its own and is not conditioned on the path that preceded that branch. The template-image match scores attached to branches do not depend on previous transitions in the path. When the decoder considers an image position and decides what character is most likely to be there based on the match scores, it does not need to look back at previous transitions in the path to this point and it doesn3 t care what characters occurred up to this point. Each image point evaluation is conditionally independent of previous evaluations. The language model, on the other hand, explicitly provides a component of the branch score that is conditioned on the characters occurring on previous branches. The additional nodes and edges needed to accommodate the paths that represent these previous states are what cause the exponential explosion in states in the graph that represents the image model.
The explosion in states significantly impacts the storage and computational resources needed to use a stochastic language model in conjunction with the image model during decoding. Expansion of the decoding data structure to allow for every possible history requires a prohibitive amount of storage. With respect to computational demands, recall that decoding is accomplished using a dynamic programming operation, such as a Viterbi procedure, to compute a set of recursively-defined likelihood functions at each point of the image plane. The increase in computation of the dynamic programming operation is Mm for an mth order Markov process with M templates. For example, when an image model includes 100 characters, a bigram stochastic language model (N=1) increases the dynamic programming computation by a factor of 100. Computational requirements, then, typically dictate that an N-gram model use a small N.
The conceptual framework of the present invention begins with the decoding operation using upper bound scores associated with branches in an unexpanded decoding data structure that represents the image network. An upper bound score indicates an upper bound on the language model probabilities or weights that would otherwise be associated with a branch according to its complete character history. The use of upper bounds on the language model probabilities prevents the iterative search that forms the decoding operation from ruling out any path that could possibly turn out to be optimal.
A best path search operation then finds a complete estimated best path through the graph. Once the path is identified, a network expansion operation is performed for nodes on the best path in order to expand the network with new nodes and branches reflecting paths with explicit character histories based on the estimated best path of the just-completed iteration. Newly-added branches have edge scores with language model scores that are based on available character histories. The decoding and expansion operations are then iterated until a stopping condition is met. The present invention expands the states of the image model only on an as-needed basis to represent the fully contextual language model probabilities or weights for a relatively small number of nodes in the image network that fall on each estimated best path, allowing for the manageable and efficient expansion of the states in the image model to accommodate the language model. The expanded decoding data structure is then available to a subsequent iteration of the best path search operation.
A key constraint necessary to ensure optimal decoding with respect to the language model is that each node in the graph have the proper language model score, either a weight or an upper bound score, attached to the best incoming branch to that node. Failure to observe this constraint may cause the dynamic programming operation to reject a path through the graph that is an actual best path because of an incorrect score attached to a branch.
The language model techniques of the present invention may be used in any text line decoder that uses as input a stochastic finite state network to model the document image layout of the document image being decoded, and where branch scores in the image network change over time, requiring iteration of the dynamic programming operation. Thus, these techniques may be used in simple text line decoders, as well as in the two-dimensional DID method of image recognition disclosed in the patents cited above.
Therefore, in accordance with one aspect of the present invention, a method is provided for operating a processor-controlled machine to decode a text line image using a stochastic language model. The machine includes a processor and a memory device for storing data including instruction data the processor executes to operate the machine. The processor is connected to the memory device for accessing and executing the instruction data stored therein. The method comprises receiving an input text line image including a plurality of image glyphs each indicating a character symbol, and representing the input text line image as an image network data structure indicating a plurality of nodes and branches between nodes. Each node in the image network data structure indicates a location of an image glyph, and each branch leading into a node is associated with a character symbol identifying the image glyph. The plurality of nodes and branches indicate a plurality of possible paths through the image network, and each path indicates a possible transcription of the input text line image. The method further comprises assigning a language model score computed from a language model to each branch in the image network according to the character symbol associated with the branch. The language model score indicates a validity measurement for a character symbol sequence ending with the character symbol associated with the branch.
The method further comprises performing a repeated sequence of a best path search operation followed by a network expansion operation until a stopping condition is met. The best path search operation produces a complete path of branches and nodes through the image network using the language model scores assigned to the branches. The network expansion operation includes adding at least one context node and context branch to the image network. The context node having a character history associated with it. The context branch indicates an updated language model score for the character history ending with the character symbol associated with the context branch. The image network with the added context node and branch are then available to a subsequent execution of the best path search operation. The method further includes, when the stopping condition has been met, producing the transcription of the character symbols represented by the image glyphs of the input text line image using the character symbols associated with the branches of the complete path.
In another aspect of the present invention, the language model score and the updated language model score indicate probabilities of occurrence of a character symbol sequence in a language modeled by the language model. In still another aspect of the present invention the language model score is an upper bound score on the validity measurement for the character symbol sequence ending with the character symbol associated with the branch, and when the language model produces the updated language model score for the character history ending with the character symbol associated with the context branch, the updated language model score replaces the upper bound score on the branches in the image network.
In still another aspect of the present invention, each node in the image network data structure has a node order determined by a history string length of the character history associated with it, and the network expansion operation adds a context node for every node in the complete path having a node order less than a maximum order. The context node has a node order one higher than the node order of the node from which the context node is created, and the context node has a text line image location identical to the text line image position of the node from which the context node is created. In this aspect of the invention, producing the complete path of nodes and branches includes computing maximum cumulative path scores at image positions in the image network using the language model scores for the character symbols assigned by the language model to the branches, with the best path search operation maximizing the cumulative path score at each image position. Computing maximum cumulative path scores by the best path search operation includes, at each image position in the text line image and for each possible character symbol and for each node and context node at each image position, first computing a next image position for the character symbol in the text line image, and then computing a cumulative path score for a path including an incoming branch to a highest order node at the next image position. Then the best path operation compares the cumulative path score to a prior maximum cumulative path score for the highest order node at the next image position to determine an updated maximum cumulative path score for the next image position, and stores the updated maximum cumulative path score with the highest order node at the next image position.
The novel features that are considered characteristic of the present invention are particularly and specifically set forth in the appended claims. The invention itself, however, both as to its organization and method of operation, together with its advantages, will best be understood from the following description of an illustrated embodiment when read in connection with the accompanying drawings. In the Figures, the same numbers have been used to denote the same component parts or steps. The description of the invention includes certain terminology that is specifically defined for describing the embodiment of the claimed invention illustrated in the accompanying drawings. These defined terms have the meanings indicated throughout this specification and in the claims, rather than any meanings that may occur in other sources, such as, for example, documents, if any, that are incorporated by reference herein elsewhere in this description.