The present invention relates generally to image decoding and image recognition techniques, and specifically to image decoding and recognition techniques using stochastic finite state networks such as Markov sources. In particular, the present invention provides a technique for producing heuristic scores for use by a dynamic programming operation in the decoding of text line images.
Automatic speech recognition systems based on stochastic grammar frameworks such as finite state Markov models are known. Examples are described in U.S. Pat. No. 5,199,077 entitled xe2x80x9cWordspotting For Voice Editing And Indexingxe2x80x9d, and in reference [2], both of which use hidden Markov models (HMMs). Bracketed numerals identify referenced publications listed in the Appendix of Referenced Documents.
Stochastic grammars have also been applied to document image recognition problems and to text recognition in particular. See, for example, the work of Bose and Kuo, identified in reference [1], and the work of Chen and Wilcox in reference [2] which both use hidden Markov models (HMMs) for word or text line recognition. See also U.S. Pat. No. 5,020,112, issued to P. A. Chou and entitled xe2x80x9cImage Recognition Using Two-Dimensional Stochastic Grammars.xe2x80x9d
U.S. Pat. No. 5,321,773, issued to Kopec and Chou, discloses a document recognition technique known as Document Image Decoding (hereafter, DID) that is based on classical communication theory. This work is further discussed in references [2], [4] and [5]. The DID model 800, illustrated in FIG. 28, includes a stochastic message source 810, an imager 811, a channel 812 and a decoder 813. The stochastic message source 810 selects a finite string M from a set of candidate strings according to a prior probability distribution. The imager 811 converts the message into an ideal binary image Q. The channel model 812 maps the ideal image into an observed image Z by introducing distortions due to printing and scanning, such as skew, blur and additive noise. Finally, the decoder 813 receives observed image Z and produces an estimate {circumflex over (M)} of the original message according to a maximum a posteriori (MAP) decision criterion. Note that in the context of DID, the estimate {circumflex over (M)} of the original message is often referred to as the transcription of observed image Z.
The structure of the message source and imager is captured formally by combining their functions into a single composite image source 815, as shown by the dotted lines in FIG. 28. Image source 815 models image generation using a Markov source. A Markov source is a stochastic finite-state automaton that describes the entire two-dimensional (2D) spatial layout and image components that occur in a particular class of document images as a regular grammar, representing these spatial layout and image components as a finite state network. Prior attempts with stochastic grammar representations of text images confined their representations to single words or single lines of text, without regard to where these words or lines were located on the 2D page. A general Markov source model 820 is depicted in FIG. 29 and comprises a finite state network made up of a set of nodes and a set of directed transitions into each node. There are two distinguished nodes 822 and 824 that indicate initial and final states, respectively. A directed transition t between any two predecessor (Lt) and successor (Rt) states in the network of FIG. 29 has associated with it a 4-tuple of attributes 826 comprising a character template, Q, a label or message string, m, a transitional probability, xcex1, and a two-dimensional integer vector displacement, xcex94.
For example, Markov source model 830 illustrated in FIG. 30 is a simple source model for the class of 2D document images that show a single column of English text in 12 pt. Adobe Times Roman font. In this model, documents consist of a vertical sequence of horizontal text lines, alternating with white (background) space. A horizontal text line is a sequence of typeset upper- and lower-case symbols (i.e., letter characters, numbers and special characters in 12 pt. Adobe Times Roman font) that are included in the alphabet used by the English language. The image coordinate system used with the class of images defined by model 830 is one where horizontal movement, represented by x, increases to the right, vertical movement, represented by y, increases downward, the upper left corner of the image is at x=y=0, and the lower right corner of the image is at x=W, y=H , where W and H respectively indicate the width and height of the image in pixels.
As illustrated in FIG. 28, a Markov source model serves as an input to an image synthesizer in the DID framework. For an ordered sequence of characters in an input message string in the English language and using model 830 of FIG. 30, the image synthesizer generates a page image of a single-text column by placing templates in positions in the page image that are specified by model 830. The operation of text column source model 830 as an image synthesizer may be explained in terms of an imager automaton that moves over the image plane under control of the source model. The movement of the automaton constitutes its path, and, in the case of model 830, follows the assumptions indicated above for the conventional reading order for a single column of text in the English language. From start state node nI at the top left corner of the image, the imager automaton enters and self-transitions through iterations of node n1 vertically downward, creating vertical white space. At some point the imager reaches the top of a text line and enters state n2 which represents the creation of a horizontal text line. The displacement (0,34) of the transition into n2 moves the imager down to the text baseline; 34 is the font height above the baseline. The self-transitions at node n2, indicated by the loop at n2 and symbols 831 and 832, represent the individual characters of the font and horizontal white space such as occurs with spaces between words. The imager transitions horizontally from left to right along the text line through iterations of node n2 until there are no more characters to be printed on the line (which may be indicated in a variety of ways not specifically shown in model 830.) At the end of the text line, the imager drops down vertically by the font depth distance 13 and transitions to node n3. At node n3 one of two things can happen. If there are remaining text lines, the imager enters xe2x80x9ccarriage returnxe2x80x9d state n4 to return to the left margin of the page and back to n1. Or, if there are no more characters or the imager has reached the bottom right corner of the page, the imager transitions from n3 to the final node nF. Node n2 may be considered the xe2x80x9cprintingxe2x80x9d state, where text lines are produced. Additional description of how an image synthesizer functions in the DID framework with model 830 may be found in U.S. Pat. No. 5,526,444 at cols. 5-7 and the description accompanying FIGS. 15-18 therein, and in U.S. Pat. No. 5,689,620, at col. 36-40 and the description accompanying FIG. 14 at col. 39-40 therein.
The attributes on the transitions in Model 830 of FIG. 30 have been simplified in this illustration. Each directed transition into n2, for example, has the associated 4-tuple of attributes shown in FIG. 29: a transition probability, a message string identifying a symbol or character in the English language, a corresponding character template in the font to be used in the page image, and a vector displacement, shown as (wt,0) in FIG. 30 that indicates the (x,y) position in the image that the path takes next. For node n2, displacement (wt,0) indicates a horizontal distance w that is the set width of the template. The set width of a template specifies the horizontal (x-direction) distance on the text line that the template associated with this transition occupies in the image.
U.S. Pat. No. 5,689,620 extended the principles of DID and the use of Markov source models to support the automatic supervised training of a set of character templates in the font of a particular collection or class of documents, thereby enabling the decoding of font-specific documents for which templates were not otherwise easily available. The use of a Markov source model to describe the spatial layout of a 2D document page and the arrangement of image components such as lines, words and character symbols on the page provides a great deal of flexibility for describing a wide variety of document layouts. This flexibility combined with automatic training of character templates in a specific font provide a powerful technological advantage in the field of automatic document recognition. DID enables the decoding (recognition) of any type of character symbols in virtually any type and size of font and in any type of 2D spatial layout.
The powerful flexibility offered by the DID system is limited in actual use by the time complexity involved in the decoding process. Decoding involves the search for the path through the finite state network representing the observed image document that is the most likely path that would have produced the observed image. U.S. Pat. No. 5,321,773 discloses that decoding involves finding the best (MAP) path through a three-dimensional (3D) decoding trellis data structure indexed by the nodes of the model and the coordinates of the image plane, starting with the initial state and proceeding to the final state. Decoding is accomplished by a dynamic programming operation, typically implemented by a Viterbi algorithm. A straightforward approach to MAP decoding is to use a two-dimensional form of a segmental Viterbi algorithm to compute a set of recursively-defined likelihood functions at each point of the image plane. The forward phase of the Viterbi procedure involves identifying, for each pixel position in the image, the most likely path for arriving at that position, from among the paths generated by the printing of each character template and by using the most likely paths for arriving at all previously computed positions. In effect, the recursive Viterbi procedure involves iterating over each image position and each transition into every node and computing the likelihood of the best path that terminates at the node and image position after passing through the transition.
With reference to the DID framework of FIG. 28, there is a set of probabilities in the image model that are derived from channel model 812. Decoder 813 looks for the most likely observed image Z that could have come from the ideal image Q, given channel model 812. Observed image Z is represented by a path through image model 815. Transcription {circumflex over (M)} is formed from the character labels identifying the templates associated with the branches in the path. Based on channel model 812, there is a certain probability distribution over a corrupted image. The probability distribution predicts certain images with certain probabilities. Decoding observed image Z involves computing a set of recursively-defined likelihood functions at each point of the image plane. The likelihood functions indicate the probability distribution evaluated on the specific set of data that is the observed image Z. Each individual node computation computes the probability that the template of a transition corresponds to a region of the image to be decoded in the vicinity of the image point. This template-image probability is represented by a template-image matching score that indicates a measurement of the match between a particular template and the image region at the image point. Producing maximum cumulative path scores at each image position using the template-image matching scores is a way of building up the likelihood in a piece by piece fashion. In terms of the decoding trellis that represents the image model, the template-image matching scores labeling the branches in the trellis are the likelihood terms.
The Viterbi procedure is carried out in the forward direction until the end-point of the best path is unambiguously identified. The backward phase of the Viterbi involves backtracing through the nodes identified as part of the best path to trace out the actual best path. The sequence of character templates associated with the transitions between each node from the start to the final node in the source model on the best path are concatenated to form the message, or transcription, of the decoded image. U.S. Pat. No. 5,526,444 discloses a more detailed description of the decoding process at cols. 7-9 and the description accompanying FIGS. 19-22 therein.
The dynamic programming operation used to decode an image involves computing the likelihood that the template of a transition corresponds to a region of the image to be decoded in the vicinity of the image point. This template-image likelihood is represented by a template-image matching score that indicates a measurement of the match between a particular template and the image region at the image point. Thus, the size and complexity of the image as defined by the model (i.e., the number of transitions) and the number of templates to be matched are major factors in computation time. Indeed, the time complexity of decoding using an image source model of the type shown in FIG. 30 and FIG. 31, and using a Viterbi or Viterbi-like algorithm, is O(∥xcex2∥xc3x97Hxc3x97W), where ∥xcex2∥ is the number of transitions in the source model and H and W are the image height and width, respectively, in pixels.
There are two factors that influence this complexity. The first is finding the baselines of horizontal text lines. Although decoding computation grows only linearly with image size, in absolute terms it can be prohibitive because, in effect, each row of pixels in the image is evaluated (decoded) as the baseline of a possible horizontal text line. For example, a two-dimensional image of a column of black text represented in a single known font printed on an 8.5xc3x9711 inch page of white background and scanned at 300 dpi resolution causes line decoding to occur 3300 times (300 dpixc3x9711 inches). Initially, when DID was first developed, decoding of such an image, without decoding efficiencies and improvements, took about 45 minutes to run, which made the system commercially impractical.
A second key bottleneck in the implementation of the dynamic programming decoding procedure is the computation of template-image matching scores. The matching operation between template and image includes aligning a template with a position in the image, performing an AND operation, and summing the resulting ON pixels, producing a score indicating the match between that template and that position of the image. Each template is matched at an image position, a score is produced, and the maximum score of all templates for that position is accumulated in a running sum to produce the score for the entire line. Thus, each template is matched at each position of a horizontal row of pixels in the image during text line decoding. If there are 100 templates and 1500-2000 x-pixel positions in a line, then each template has to be matched at each x position on the line, requiring a minimum of 105 ANDs and sums to produce actual scores for each position on the line. It was found that decoding accuracy could be improved with the use of multi-level templates (described in more detail below in the Detailed Description) which require separate template-image matching for each of several levels. Moreover, when the position of an actual baseline is not known, scoring must be done for several horizontal rows of pixels around an actual baseline before the node that transitions to the actual baseline is identified. Data showed that each template could be matched at as many as five vertical pixel positions. Thus it was estimated that actual template-image matching scoring for each text line in an image required at least 106 image-template ANDs and sums. In early implementations of DID, this computation was found to far outweigh all other parts of the decoding process.
The need to improve decoding efficiency resulted in a new view of the 2D Markov source model that provides the basic description of a page layout. U.S. Pat. No. 5,526,444 (hereafter, the ""444 ICP patent) issued to Kopec, Kam and Chou and entitled xe2x80x9cDocument Image Decoding Using Modified Branch-And-Bound Methods,xe2x80x9d discloses the use of a class of Markov source models called separable Markov models. When a 2D page layout is defined as a separable Markov source model, it may be factored into a product of 1D models that represent horizontal and vertical structure, respectively. More formally, a separable model is a collection of named Markov sub-sources that is similar to a recursive transition network. The top-level sub-source is a vertical model whose nodes are all tightly constrained to specific horizontal positions in the image. These position constraints restrict entry to the node to certain regions of the image plane. The ""444 ICP patent further discloses that decoding with a separable model involves finding the best path through the 2D decoding trellis defined by the nodes of the top-level model, some of which are position-constrained, and the vertical dimension of the image. The computational effect of a position constraint is to restrict the decoding lattice for a node to a subset of the image plane, providing significant computational savings when used with standard Viterbi decoding.
The ""444 ICP patent further discloses the use of a recursive Markov source. A recursive source is a collection of named sub-sources each of which is similar to a constrained Markov source except that it may include an additional type of transition. A recursive transition is labeled with a transition probability and the name of one of the Markov sub-sources. The interpretation of a recursive transition is that it represents a copy of the named sub-source. Thus, some of the transitions of the top-level vertical model are labeled with horizontal models. One aspect of each of the horizontal models is that every complete path through the model starts at a fixed horizontal position and ends at a fixed horizontal position, effectively reducing decoding to a one-dimensional search for the best path. A second aspect is that the vertical displacement of every complete path in the model is a constant that is independent of the vertical starting position of the path. Thus, the horizontal models describe areas of the image plane that are text lines, and the top-level vertical model with its nodes that are constrained by position defines which rows of pixels in the 2D image are to be considered as potential text lines. The match score for each branch is computed by running the horizontal model (i.e., performing the Viterbi procedure) along the appropriate row of the image. The overall decoding time for a separable model is dominated by the time required to run the horizontal models.
In conjunction with the use of separable models, the ""444 ICP patent also discloses a heuristic algorithm called the Iterated Complete Path (hereafter, ICP) algorithm that fits into the framework of the Viterbi decoding procedure utilized by DID but improves on that procedure by focusing on a way to reduce the time required to decode each of the horizontal models, or lines of text. The ICP algorithm disclosed in the ""444 ICP patent is an informed best-first search algorithm that is similar to heuristic search and optimization techniques such as branch-and-bound and A* algorithms. During decoding, ICP causes the running of a horizontal model (i.e., computes the actual template-image matching scores) for only a reduced set of transitions into each node, the reduced number of transitions being substantially smaller than the number of all possible transitions into the node. Additional information about the ICP best-first search algorithm may also be found in reference [6].
ICP reduces the number of times the horizontal models are run by replacing full Viterbi decoding of most of the horizontal rows of pixels with the computation of a simple upper bound on the score for that row. This upper bound score is developed from a heuristic function. ICP includes two types of parameterized heuristic functions. The ""444 ICP patent discloses that the parameters of both of these functions may be automatically inferred from the source model. Thus, for each area of the image where the vertical model indicates that a horizontal text line may occur, in place of performing full Viterbi decoding on each pixel row in that area as if each row could be a text baseline, a heuristic score is first developed for each horizontal row instead.
The ""444 ICP patent discloses that the first heuristic function produces a heuristic template-image matching score that is based on weighted horizontal pixel projections. In particular, a heuristic score is developed from a sum of xe2x80x9cONxe2x80x9d pixels in a region of horizontal pixel rows in the image that constitute a potential text line. The sum is weighted by a constant to produce the heuristic score. The constant is selected from a source model-dependent vector of non-negative constants that sum to one. This vector is developed from the actual templates that occur in the image and is based on the assumption that the horizontal projection profile has the same shape for every character template from a given source model. For example, for a simple text model such as model 830 in FIG. 30, the vector of constants may be computed as a linear combination of the profiles of the individual character templates, weighted by their relative character frequencies. See FIG. 23 in the ""444 ICP patent for the pseudo code of the procedure that computes the weighted horizontal pixel projection heuristic.
Decoding using the weighted pixel projection heuristic proceeds as follows. The inputs to the ICP procedure are the top-level Markov source, a procedure that computes actual template-matching scores, and a procedure that computes the weighted pixel projection heuristic scores. The ICP procedure maintains two data arrays, U and A, indexed by a vertical transition number. The elements of U are initialized with the heuristic scores for each vertical transition prior to iterations of decoding. Boolean array A keeps track of whether U contains a heuristic score or an actual score for each transition. Prior to the first iteration, Viterbi decoding of the top-level model is performed using array U containing all heuristic scores, and an estimated best path is returned. Then a loop is executed for as long as heuristic scores remain as scores in the current estimated best path. This loop includes first computing an actual template matching score for every transition in the estimated best path having a heuristic score, replacing the heuristic score in array U and updating array A to show an actual score for that transition. Then Viterbi decoding of the top-level model is performed using the scores in array U which contain both actual and heuristic scores and a current estimated best path is returned. Transitions in the current estimated best path that contain heuristic scores are identified, and the loop continues with computing an actual template matching score for every transition in the estimated best path having a heuristic score. The ICP procedure ends when the current estimated best path contains actual scores for all transitions in the path. Note that in some implementations, the ICP procedure may also conclude when the best path in a current iteration is the same as the best path in a prior iteration. Decoding concludes with producing the message string associated with the transitions in the best path. More details of the ICP procedure may be found in the ""444 ICP patent, beginning at col. 16 and the discussion accompanying FIG. 7 therein. The result of using a separable model with ICP is that actual full Viterbi decoding only takes place in regions of the image that are expected to include text lines, i.e., for those vertical transitions in the top-level vertical model that are recursive transitions.
The second heuristic is called the adjacent row heuristic and is an upper bound on the actual score for the two rows of pixels in the image that are immediately above and below the row for which an actual score has been computed. The adjacent row heuristic formalizes the observation that the actual score for a transition normally doesn""t change much from one row to the next. Thus, actual score values completed during an ICP pass may be used to infer new upper bounds on adjacent row heuristic values that are tighter than the initial upper bound heuristics. In practice, the adjacent row heuristic can be used at the end of each pass in ICP to update entries for rows in array U that are adjacent to rows having newly-computed actual score values. A coefficient is computed using parameters in the source model and the coefficient is used to produce new heuristic scores for rows above and below a row having an actual score computed during this iteration, by multiplying this coefficient by the actual score. See the ""444 ICP patent beginning at col. 23 for an illustration of how both heuristics work in the decoding of a single line of text. In the example shown, a document image having a height of 10 rows would require 10 iterations of full Viterbi decoding of the text line without the use of ICP and only three iterations of full Viterbi decoding of the text line using both ICP heuristics.
In the ""444 ICP patent, the use of a finite state model defined as a constrained and recursive Markov source combined with the ICP algorithm allow for particular transitions to be abandoned as not likely to contain the best path, thereby reducing computation time. Full decoding using the longer computation process of computing the template-image matching scores for a full horizontal line is carried out only over a much smaller number of possible transitions. The ""444 ICP patent discloses that the replacement of many long transition score computations with shorter heuristic score computations is responsible for a remarkable decrease in overall computation time reported to be a decrease by factor of 11 in one example and a decrease of a factor of 19 in another example. In an example of the decoding of a single text line image of image height H=10, illustrated in the ""444 ICP patent, it was shown that the ICP procedure implemented with both heuristics reduced the number of full-line Viterbi decoding iterations from 10 lines to 3 lines. This means that actual template-image matching scoring needed to be performed for only three of the possible 10 horizontal lines in the image.
While the invention disclosed in the ""444 ICP patent provided a significant improvement in overall decoding time over full Viterbi decoding, document recognition of a single page of single-column text using the DID method still required a commercially impractical amount of time. Experiments reported in the ""444 ICP patent (see Table 2), for example, showed a decoding time of over two minutes for a full page, single column of text. This time is largely taken up by performing full Viterbi decoding on the individual horizontal text lines, when actual scores are computed to replace the heuristic scores. The improvements provided by the technical advances disclosed in the ""444 ICP patent, while significant, did not address the efficient decoding of an individual text line. Additional reductions in the decoding time of individual text lines are desirable.
Investigation into the reasons for decoding inefficiencies using the improved decoding techniques of the ""444 ICP patent showed that replacing the heuristic scores with actual template-image matching scores during decoding of individual text lines, as required by the ICP method, was a central factor in the computation time required to decode a page of text. A full page (8xc2xdxc3x9711 inch) text document image scanned at 300 dpi (spots per inch) results in 3300 horizontal rows of pixels. Even if the ICP method reduced decoding of horizontal lines by a factor of two-thirds as suggested by the reported illustration, that would still result in over 1000 horizontal lines of decoding, requiring upwards of 106 actual scores per line. Thus, it was hypothesized that the DID method of document recognition could be improved if there were one or more ways to achieve a reduction in the computation time needed to perform full Viterbi decoding of each text line. Because scoring measures the degree of a match between a character template and the observed image, however, it was also imperative that a method for achieving such a reduction in computation time still preserve the remarkable accuracy of the DID method of document recognition.
The technique of the present invention is based on the observation that a heuristic score that was both simpler to compute than an actual score, at best as large as the actual score, and sufficiently accurate to represent an actual score in decoding computations could be used to eliminate the need to compute actual template-image matching scores during full Viterbi decoding of a text line. The present invention identifies such a scoring heuristic based on information about corresponding columns of pixels in both the templates and the image region of the text line being decoded. There are two significant advantages to the scoring heuristic of the present invention. Heuristic column-based scoring produces a true upper bound score for intra-line nodes in the stochastic finite state network that represents the document image model, and so the heuristic scores may be used during line decoding to reduce the number of actual template image matching scores that need to be computed, without sacrificing any accuracy in line decoding.
In addition, using the heuristic scores essentially reduces the two-dimensional computation of the actual template-image scores to a simpler one-dimensional computation for the heuristic scores. The simpler but accurate computation of the column-based heuristic scores provide a significant improvement in computational efficiency because they replace a very large number of computationally expensive actual template image scores required by prior decoding methods. Computing an actual score involves performing an AND operation between a 2D character template and a 2D observed image region and then summing the resulting ON pixels. A column-based heuristic score is computed using one-dimensional data structures that represent information about the counts of ON pixels in columns of the character templates and in columns of the image. These one-dimensional data structures are referred to herein as analogues, or surrogates, of the templates and image. Tests of DID on a full page document using heuristic column-based scoring in place of actual scoring during line decoding show decoding times of thirty (30) seconds or less, as compared to two minute decoding time using the ICP method disclosed in the ""444 ICP patent.
The heuristic scoring technique of the present invention may be used in any text line decoder that uses as input a stochastic finite state network that models the document image layout of the document image being decoded. Thus, it may be used in simple text line decoders as well as in the two-dimensional DID method of image recognition disclosed in the patents cited above.
Several embodiments of the heuristic scores are illustrated. In a first embodiment, a template analogue data structure for each character template is produced in the form of a one-dimensional array of the pixel counts of ON pixels in character template columns. An image analogue data structure in the form of a one-dimensional array of the pixel counts of ON pixels in observed image columns is also produced. A column-based heuristic template-image score is computed by comparing a template analogue data structure with the image analogue data structure, and then computing a sum of the minimum of the two numbers for the width of the template. A second embodiment implements column-based heuristic scoring for document image decoding systems using multi-level templates. In this embodiment, the template analogue data structure created for each template essentially functions as a lookup table for a heuristic score for a column count of ON pixels in the image, and the heuristic score is computed for an entire template by summing the individual column scores retrieved from each of the lookup tables. In still another embodiment, the upper bound heuristic score is less rigorously specified and the amount of computation required to produce the score is further reduced. This is achieved by forming the template and image analogue data structures from combined counts of ON pixels in adjacent column in the image and in the templates, and computing the heuristic scores using those analogue data structures. In effect, some pixel counts for columns are interpolated from counts in adjacent columns.
Therefore, in accordance with one aspect of the present invention, a method is provided for operating a processor-controlled machine to decode a text line image. The machine includes a processor and a memory device for storing data including instruction data the processor executes to operate the machine. The processor is connected to the memory device for accessing and executing the instruction data stored therein. The method comprises receiving an input text line image indicating a bitmapped image region including a plurality of image glyphs each indicating a character symbol. The method further comprises obtaining a plurality of character templates and character labels stored in the memory device of the machine. Each character template indicates a two-dimensional bitmapped image of a character symbol, and a character label identifying the character symbol represented by the character template.
The method further comprises producing a one-dimensional (1D) image analogue data structure using pixel counts of image foreground pixels in columns of the image portion of the input text line image, and producing a plurality of 1D template analogue data structures using pixel counts of template foreground pixels in columns of the character templates. Then a plurality of template-image heuristic scores are computed using the 1D image analogue data structure and the plurality of 1D template analogue data structures. Each template-image heuristic score indicates an estimated measurement of a match between one of the plurality of character templates and a two-dimensional region of the image portion of the input text line image. Then, the method performs a dynamic programming operation using a decoding trellis data structure. The decoding trellis data structure indicates a stochastic finite state network including nodes and transitions between nodes indicating a model of expected spatial arrangements of character symbols in the input text line image. The dynamic programming operation uses the plurality of template-image heuristic scores to decode the input text line image and produce the character labels of the character symbols represented by the image glyphs included therein.
In another aspect of the invention for decoding a text line image, producing the plurality of 1D template analogue data structures includes producing a 1D template pixel sums data structure for each character template. Each 1D template pixel sums data structure indicates counts of template column foreground pixels in groups of at least two consecutive columns of the character template. Also in this aspect of the invention, producing the 1D image analogue data structure includes producing a 1D image pixel sums data structure including counts of image column foreground pixels in at least every two adjacent column of pixels in the image portion such that the 1D image pixel sums data structure includes a combined image column count for every column of pixels in the image portion. In this aspect of the invention, computing a template-image heuristic score includes the steps of determining, for each combined template column count in the 1D template pixel sums data structure, a minimum of the combined template column count of the template column foreground pixels and the combined image column count of the image column foreground pixels, and summing the minima to produce the template-image heuristic score, such that one template-image heuristic score is computed for each 1D template pixel sums data structure at each column position in the image portion.
In another aspect of the invention for decoding a text line image, producing the 1D image analogue data structure includes computing a plurality of combined image column counts of image column foreground pixels to produce a 1D image pixel sums data structure. Computing each combined image column count includes the steps of producing a count of foreground pixels in at least every two adjacent column of the image portion, determining a maximum count of foreground pixels for each pair of consecutive counts of foreground pixels, and storing the maximum count of foreground pixels as a combined image column count in the 1D image pixel sums data structure, such that the 1D image pixel sums data structure includes a combined image column count for every other one of the columns of the image portion. Computing a template-image heuristic score includes first computing a first template-image heuristic score using the 1D template pixel sums data structure indicating a first character template at a first column position in the image portion. This computing step includes determining, for each combined template column count in the 1D template pixel sums data structure, a minimum of the combined template column count of the template column foreground pixels and the combined image column count of the image column foreground pixels, and summing the minima to produce the template-image heuristic score. Then the first template-image heuristic score is assigned as the template-image heuristic score for the first character template at a next adjacent column position in the image portion such that one template-image heuristic score is computed for each 1D template pixel sums data structure at every other column position in the image portion.