The present invention relates generally to image decoding and image recognition techniques, and in particular to such techniques using stochastic finite state networks such as Markov sources. In particular, the present invention provides a technique for improving the efficiency of decoding text line images using a stochastic finite state network.
Automatic speech recognition systems based on stochastic grammar frameworks such as finite state Markov models are known. Examples are described in U.S. Pat. No. 5,199,077 entitled xe2x80x9cWordspotting For Voice Editing And Indexingxe2x80x9d, and in reference [2], both of which use hidden Markov models (HMMs). Bracketed numerals identify referenced publications listed in the Appendix of Referenced Documents.
Stochastic grammars have also been applied to document image recognition problems and to text recognition in particular. See, for example, the work of Bose and Kuo, identified in reference [1], and the work of Chen and Wilcox in reference [2] which both use hidden Markov models (HMMs) for word or text line recognition. See also U.S. Pat. No. 5,020,112, issued to P. A. Chou and entitled xe2x80x9cImage Recognition Using Two-Dimensional Stochastic Grammars.xe2x80x9d
U.S. Pat. No. 5,321,773 (hereafter the ""773 DID patent), issued to Kopec and Chou, discloses a document recognition technique known as Document Image Decoding (hereafter, DID) that is based on classical communication theory. This work is further discussed in references [2], [4] and [5]. The DID model 800, illustrated in FIG. 7, includes a stochastic message source 810, an imager 811, a channel 812 and a decoder 813. The stochastic message source 810 selects a finite string M from a set of candidate strings according to a prior probability distribution. The imager 811 converts the message into an ideal binary image Q. The channel 812 maps the ideal image into an observed image Z by introducing distortions due to printing and scanning, such as skew, blur and additive noise. Finally, the decoder 813 receives observed image Z and produces an estimate {circumflex over (M)} of the original message according to a maximum a posteriori (MAP) decision criterion. Note that in the context of DID, the estimate {circumflex over (M)} of the original message is often referred to as the transcription of observed image Z.
The structure of the message source and imager is captured formally by combining their functions into a single composite image source 815, as shown by the dotted lines in FIG. 7. Image source 815 models image generation using a Markov source. A Markov source is a stochastic finite-state automaton that describes the spatial layout and image components that occur in a particular class of document images as a regular grammar, representing these spatial layout and image components as a finite state network. A general Markov source model 820 is depicted in FIG. 8 and comprises a finite state network made up of a set of nodes and a set of directed transitions into each node. There are two distinguished nodes 822 and 824 that indicate initial and final states, respectively. A directed transition t between any two predecessor (Lt) and successor (Rt) states in the network of FIG. 8 has associated with it a 4-tuple of attributes 826 comprising a character template, Q, a label or message string, m, a transitional probability, xcex1, and a two-dimensional integer vector displacement, xcex94. The displacement indicates a horizontal distance that is the set width of the template. The set width of a template specifies the horizontal (x-direction) distance on the text line that the template associated with this transition occupies in the image.
U.S. Pat. No. 5,689,620 extended the principles of DID and the use of Markov source models to support the automatic supervised training of a set of character templates in the font of a particular collection or class of documents, thereby enabling the decoding of font-specific documents for which templates are not otherwise easily available. The use of a Markov source model to describe the spatial layout of a 2D document page and the arrangement of image components such as lines, words and character symbols on the page provides a great deal of flexibility for describing a wide variety of document layouts. This flexibility, combined with automatic training of character templates in a specific font, provide a powerful technological advantage in the field of automatic document recognition. DID enables the decoding (recognition) of any type of character symbols in virtually any type and size of font and in any type of 2D spatial layout.
The powerful flexibility offered by the DID system is limited in actual use by the time complexity involved in the decoding process. Decoding involves the search for the path through the finite state network representing the observed image document that is the most likely path that would have produced the observed image. The ""773 DID patent discloses that decoding involves finding the best (MAP) path through a three-dimensional (3D) decoding trellis data structure indexed by the nodes of the model and the coordinates of the image plane, starting with the initial state and proceeding to the final state. Decoding is accomplished by a dynamic programming operation, typically implemented as a Viterbi algorithm. A general description of the implementation of the Viterbi algorithm in the context of Document Image Decoding is omitted here and is provided in the discussion of an implementation of the present invention in the Detailed Description below.
The dynamic programming operation used to decode an image involves computing the probability that the template of a transition corresponds to a region of the image to be decoded in the vicinity of the image point. This template-image probability is represented by a template-image matching score that indicates a measurement of the match between a particular template and the image region at the image point. Branches in the decoding trellis are labeled with the matching scores. The size and complexity of the image as defined by the model (i.e., the number of transitions) and the number of templates to be matched are major factors in computation time. Indeed, the time complexity of decoding using a two-dimensional image source model and a dynamic programming operation, is O(∥xcex2∥xc3x97Hxc3x97W), where ∥xcex2∥ is the number of transitions in the source model and H and W are the image height and width, respectively, in pixels.
There are two factors that influence this complexity. The first is finding the baselines of horizontal text lines. Although decoding computation grows only linearly with image size, in absolute terms it can be prohibitive because, in effect, each row of pixels in the image is evaluated (decoded) as the baseline of a possible horizontal text line. For example, a two-dimensional image of a column of black text represented in a single known font printed on an 8.5xc3x9711 inch page of white background and scanned at 300 dpi resolution causes line decoding to occur 3300 times (300 dpixc3x9711 inches).
A second key bottleneck in the implementation of the dynamic programming decoding procedure is the computation of template-image matching scores. A score is the measurement of the match between a template and a 2D region of the image. Each template is matched at each position of a horizontal row of pixels in the image during text line decoding. If there are 100 templates and 1500-2000 x-pixel positions in a line, then each template has to be matched at each x position on the line, requiring a minimum of 105 actual scores for the line. When the position of an actual baseline is not known exactly, each template could be matched at as many as five vertical pixel positions as well. Thus it was estimated that actual template-image matching scoring for each text line in an image required at least 106 image-template scores. In early implementations of DID, this computation was found to far outweigh all other parts of the decoding process.
U.S. Pat. No. 5,526,444 (hereafter, the ""444 ICP patent) issued to Kopec, Kam and Chou and entitled xe2x80x9cDocument Image Decoding Using Modified Branch-And-Bound Methods,xe2x80x9d discloses several techniques for improving the computational efficiency of decoding using the DID system. The ""444 ICP patent disclosed the use of a class of Markov source models called separable Markov models. When a 2D page layout is defined as a separable Markov source model, it may be factored into a product of 1D models that represent horizontal and vertical structure, respectively. The ""444 ICP patent further discloses that decoding with a separable model involves finding the best path through the 2D decoding trellis defined by the nodes of the top-level model, some of which are position-constrained, and the vertical dimension of the image. The computational effect of a position constraint is to restrict the decoding lattice for a node to a subset of the image plane, providing significant computational savings when used with standard Viterbi decoding.
The ""444 ICP patent further discloses the use of a recursive Markov source. A recursive source is a collection of named sub-sources each of which is similar to a constrained Markov source except that it may include an additional type of transition. A recursive transition is labeled with a transition probability and the name of one of the Markov sub-sources. The interpretation of a recursive transition is that it represents a copy of the named sub-source. Thus, some of the transitions of the top-level vertical model are labeled with horizontal models. One aspect of each of the horizontal models is that every complete path through the model starts at a fixed horizontal position and ends at a fixed horizontal position, effectively reducing decoding to a one-dimensional search for the best path. A second aspect is that the vertical displacement of every complete path in the model is a constant that is independent of the vertical starting position of the path. Thus, the horizontal models describe areas of the image plane that are text lines, and the top-level vertical model with its nodes that are constrained by position defines which rows of pixels in the 2D image are to be considered as potential text lines. The match score for each branch is computed by running the horizontal model (i.e., performing the Viterbi procedure) along the appropriate row of the image. The overall decoding time for a separable model is dominated by the time required to run the horizontal models, that is, to decode individual text lines.
In conjunction with the use of separable models, the ""444 ICP patent also discloses a heuristic algorithm called the Iterated Complete Path (hereafter, ICP) algorithm that fits into the framework of the Viterbi decoding procedure utilized by DID but improves on that procedure by focusing on a way to reduce the time required to decode each of the horizontal models, or lines of text. The ICP algorithm disclosed in the ""444 ICP patent is an informed best-first search algorithm that is similar to heuristic search and optimization techniques such as branch-and-bound and A* algorithms. During decoding, ICP causes the running of a horizontal model (i.e., computes the actual template-image matching scores) for only a reduced set of transitions into each node, the reduced number of transitions being substantially smaller than the number of all possible transitions into the node. ICP reduces the number of times the horizontal models are run by replacing full Viterbi decoding of most of the horizontal rows of pixels with the computation of a simple upper bound on the score for that row. This upper bound score is developed from a heuristic function. ICP includes two types of parameterized heuristic functions. Additional information about the ICP best-first search algorithm may also be found in reference [6].
In the ""444 ICP patent, the use of a finite state model defined as a constrained and recursive Markov source combined with the ICP algorithm allow for particular transitions to be abandoned as not likely to contain the best path, thereby reducing computation time. Full decoding using the longer computation process of computing the template-image matching scores for a full horizontal line is carried out only over a much smaller number of possible transitions, in regions of the image that are expected to include text lines. The reader is directed to the ""444 ICP patent for more details about the heuristic scores disclosed therein. In particular, see the discussion in the ""444 ICP patent beginning at col. 16 and accompanying FIG. 7 therein, and refer to FIG. 23 for the pseudo code of the procedure that computes the weighted horizontal pixel projection heuristic.
While the invention disclosed in the ""444 ICP patent provided a significant improvement in overall decoding time over full Viterbi decoding, document recognition of a single page of single-column text using the DID method still required a commercially impractical amount of time. Experiments reported in the ""444 ICP patent (see Table 2), for example, showed a decoding time of over two minutes for a full page, single column of text. This time is largely taken up by performing full Viterbi decoding on the individual horizontal text lines, when actual template-image matching scores are computed to replace the heuristic scores. Consider that a full page (8xc2xdxc3x9711 inch) text document image scanned at 300 dpi (spots per inch) results in 3300 horizontal rows of pixels. Even if the ICP method reduced decoding of horizontal lines by a factor of two-thirds as suggested by the reported illustration, that would still result in over 1000 horizontal lines of decoding, requiring upwards of 106 actual template-image matching scores per line. The improvements provided by the technical advances disclosed in the ""444 ICP patent, while significant, did not address the efficient decoding of an individual text line. Additional reductions in the decoding time of individual text lines, while still maintaining the overall theoretical framework of the DID method, are desirable.
In the concurrently filed Heuristic Scoring disclosure, reductions in the decoding time of individual text lines were achieved by initially computing and using column-based, upper-bound template-image scores, referred to as heuristic scores, on the branches of the decoding trellis. Use of upper bound heuristic scores, in turn, resulted in the need for the dynamic programming operation to iterate the decoding of a text line. After a decoding iteration, the actual template-image matching scores were computed for the incoming branches of the nodes that were found to be on the estimated best path for that iteration. Decoding the text line was repeated until all heuristic scores for incoming branches to the nodes on the best path had been replaced by actual template-image matching scores. Thus, actual template-image matching scores, which are computationally expensive to produce relative to the upper bound heuristic scores, were computed only as needed. However, some of the computational efficiency achieved through the use of the column-based upper bound heuristic scores was offset by the need for additional iterations of decoding the text line. Because the heuristic scores as designed provide very good upper bound scores, the offset in computational efficiency that results from the additional text line decoding iterations is acceptably small in view of the efficiency gained by the use of the simpler computations involved in column-based heuristic scoring. Thus, there is a balance to be maintained between the efficiencies gained from using simpler scoring methods and the efficiency lost by the resulting increase in the number of iterations of the dynamic programming process. It is desirable, therefore, to improve the operation of the dynamic programming process when decoding individual text lines in order to achieve still further reductions in text line image decoding time.
The present invention is motivated by the experimental results and observations of the decoding of a text line image when upper bound column-based template-image matching scores, referred to herein as heuristic scores, are used during decoding. Specifically, the fact that the decoding operation must perform several decoding iterations on a text line before a final transcription is produced leads to several observations. The computationally expensive part of each decoding iteration is the computation of the maximum cumulative path score for a given image position. When a decoding iteration is complete, a post-line-decoding operation computes actual template-image matching scores for the templates that are associated with the branches into the nodes that are on the current estimated best path, and replaces the upper bound scores on the branches in the decoding trellis with these actual scores. The next iteration typically produces an estimated best path containing some nodes and branches that differ from the best path of the iteration before, as a result of using actual template-image matching scores in place of the upper bound scores. The decoding iterations produce an estimated path that converges to a maximum likelihood path as more actual template-image matching scores are used in the decoding trellis.
The modification to the decoding operation that forms the basis of the present invention is premised on the observation that results from a prior iteration of decoding may provide useful information to the decoder during the current decoding iteration. In particular, it is known that the values of the scores data on the decoding trellis that are input to the current decoding iteration change from their values during the last iteration only at nodes where one or more preceding branches on the prior estimated best path have just been re-scored. This means that it would be reasonable to expect that, in the current iteration, the cumulative path scores might change in the area of the text line where the re-scoring occurred, since an actual score is now an input to the cumulative path score computations. These newly computed cumulative path scores may cause new nodes to be selected as part of the estimated best path during the backtracing process of the current iteration. However, it does not necessarily follow that this would result in every case. That is, it may also be reasonable to expect that the effect of a re-scored node on the locations of subsequent nodes on the path may be local to the portion of the line where it occurs. The locations of nodes in other image areas of the best path may not change between first and second iterations if no re-scoring of branches on the trellis occurred in those image areas.
This observation makes use of a characteristic of graph theory known as a cut set. A cut set is a subset of the nodes that separate two parts of the graph in a manner such that the only connections between the two separate parts must go through the cut set. A sequence of nodes which spans the maximum set width of a character template forms a cut set in the trellis. Once the change value in the cumulative path score between current and prior decoding iterations as a result of the earlier re-scored branch has been found to be constant for a sufficient number of image positions, the constant change value is guaranteed to be constant at least until the next re-scored branch is encountered, and all subsequent image positions until that point are included in the cut set.
The improved decoding operation disclosed herein exploits the possibility of such local changes by skipping the computation of new cumulative path scores when it is determined that the local change conditions are met. That is, when it is determined that the change in the new cumulative path score that results from the re-scored branch at a given image position is observed to be the same for a sequence of image positions greater than the maximum set width of the character templates, the decoder simply adopts as the current path scores for subsequent image positions the prior cumulative path scores plus the change amount, until another re-scored branch is encountered. This reduces the number of image positions on the text line at which a cumulative path score must be computed, and so reduces the additional computational overhead that results from iterating the decoding process.
This new operating mode is referred to as skip mode. Information about scores and nodes produced by the decoder as a result of at least one complete prior decoding iteration is stored, including a record of the score of the best incoming branch at each node. Skip mode processing tracks the change in a cumulative score produced by a re-scored branch in the trellis that occurred as a result of the just prior iteration. If the change in score is constant for a length of image positions greater than the maximum set width of a character template, then the decoder propagates the score change to the best cumulative scores in the following image positions, until the next incoming re-scored branch is encountered, using best incoming branch scores at these nodes that are maintained from the previous iteration. Skip mode processing achieves significant decoding efficiencies as decoding progresses. After the first iteration of decoding, subsequent decoding iterations begin with the skip mode setting on, and, as the nodes for the first part of the text line remain unchanged from iteration to iteration, more and more of the text line is processed in skip mode.
Note that there may be some operating environments where it is not strictly necessary to require that the change in cumulative path score be a fixed constant amount for a certain number of image positions in order to turn skip mode processing on. The change in cumulative path score could be substantially constant; that is, the change could be within a small range, say a very small percentage of the total cumulative score. So, for example, where a total cumulative path score is in the 100,000 range, the change in the cumulative path score might deviate from a constant change by 10 or less and still be acceptable for measuring whether skip mode processing should be turned on. The proper range would depend on the computational environment, and those of skill in the art will appreciate that the smaller the acceptable deviation from a constant is, the more likely will be the accuracy and stability of the final decoding outcome.
The decoding technique of the present invention may be used in any text line decoder that uses as input a stochastic finite state network that models the document image layout of the document image being decoded, and that requires iteration of the decoding operation because the template-image matching scores labeled on the branches of the decoding trellis are estimated or changing scores. It may be used in simple text line decoders, as well as in the two-dimensional DID method of image recognition disclosed in the patents cited above.
Moreover, the skip mode decoding operation of the present invention has broader use in a wide variety of computational environments that has these operational characteristics: (i) the problem being solved involves using a dynamic programming operation and a data structure representation of the problem in the form of a trellis, trellis-like or graph data structure, where the dynamic programming operation produces cumulative scores as an output from which a path through the graph or trellis may be derived; and (ii) the scores labeled on the branches in the decoding data structure are estimated or changing scores that vary over time during decoding. Situations in which the present invention may be used are in contrast to traditional dynamic programming operations where the scores on the branches are fixed scores and do not vary during decoding. In the case of DID, these scores may be the upper-bound template-image matching scores discussed in the concurrently filed Heuristic Scoring disclosure, or they may be estimated scores for some other reason.
Therefore, in accordance with one aspect of the present invention, a method is provided for operating a processor-controlled machine to decode a text line image. The method comprises, while a skip mode switch is off, for each image position in the text line image, a first computing step of computing a maximum cumulative score indicating a measurement of a match between a sequence of character templates and an image region in the text line image from a starting location of the text line image to the image position. The method then comprises a second computing step of computing a score change value between the maximum cumulative score and a prior maximum cumulative score computed at the image position, and then comparing the score change value to a prior score change value and turning the skip mode switch on when the score change value is substantially constant for at least a predetermined number of consecutive image positions in the text line image. The method further comprises, while the skip mode switch is on, for each image position in the text line image, a third computing step of computing the maximum cumulative score by adding the score change value to a prior maximum cumulative score computed at the image position. The method then comprises producing a transcription of the text line image using the maximum cumulative scores.
The novel features that are considered characteristic of the present invention are particularly and specifically set forth in the appended claims. The invention itself, however, both as to its organization and method of operation, together with its advantages, will best be understood from the following description of an illustrated embodiment when read in connection with the accompanying drawings. In the Figures, the same numbers have been used to denote the same component parts or steps. The description of the invention includes certain terminology that is specifically defined for describing the embodiment of the claimed invention illustrated in the accompanying drawings. These defined terms have the meanings indicated throughout this specification and in the claims, rather than any meanings that may occur in other sources, such as, for example, documents, if any, that are incorporated by reference herein elsewhere in this description.