Currently, there is a continuing desire for improved handwriting recognition (HWR) systems, in particular for handheld devices such as PDAs and smart phones. Embedded HWR systems that are used in such devices should provide high accuracy and real-time speed with a small memory footprint.
Scanning n-tuple (SNT) classifiers provide accurate, high-speed recognition for offline or online character data. SNTs are maximum-likelihood classifiers that are applied to chain code feature sequences, where the probability of observing the complete code is given by the ensemble probability for observing all of the SNTs derived from the chain code.
SNT recognizers have demonstrated the potential for excellent speed and accuracy for on-line HWR analysis, but regrettably these recognizers consume significant memory resources. The present invention significantly reduces the memory use of the SNT recognizer through the use of mixture models and distributional clustering techniques. It is to be understood that this invention is applicable to any system that uses one or more probability tables.
SNT Recognizer
In regard to the implementation of a mixture model technique for probability table compression, each character sample of a character class C, the SNT algorithm generates a variable length sequence of features, f1, . . . fL. We define the i-th n-tuple of a given feature sequence to be:Xi1,N=(fi+k, fi+2k, . . . , fi+Nk)  (Equation 1)Where i=1, . . . L−Nk, and k is the sub-sampling distance. The SNT assumes that the n-tuples are all independent, thus the probability of observing a given sequence of n-tuples is given by:
                              P          ⁡                      (                                                            ⋃                  i                                ⁢                                  X                                      1                    ,                    N                                    i                                            |              C                        )                          =                              ∏            i                    ⁢                      P            ⁡                          (                                                X                                      1                    ,                    N                                    i                                |                C                            )                                                          (                  Equation          ⁢                                          ⁢          2                )            
The joint probability P(X1,N|C) is modeled by a lookup table of the normalized frequency counts of each of the possible n-tuples observed in the n-tuples for all the data for a given class C.
In regard to the implementation of a distributional clustering technique for probability table compression for each sample of a character class ci, the SNT algorithm generates a variable length chain code that is sub-sampled into tuples of length n with features f1, f2, . . . fn, where each code f ranges from 0 to σ−1.
In training, we assume a uniform distribution of the class prior probabilities p(ci) for the set of Q character classes C={c1, c2, . . . cQ} and estimate the probability distribution P(C|Ti) of the observed n-tuples at each i. In decoding, given a sequence of observed n-tuples τ=(t1, t2, . . . , tM), where tkε{T1, T2, . . . , Tσn}, k=1,2, . . . , M, the SNT classifier assumes that the n-tuples are mutually independent. Note, in addition that Xi1,Nε{T1, T2, . . . Tσn}.
Using the Bayes rule and assuming a uniform distribution of class prior probabilities, it can be shown that the posterior probability of the input belonging to class ci, p(ci|τ), is determined by the product of the conditional probabilities of class ci given each individual n-tuple. Thus the classifier selects the character class with highest posterior probability as given by:
                    c        =                  arg          ⁢                                          ⁢                                    max              i                        ⁢                                          ∏                                  k                  =                  1                                M                            ⁢                              p                ⁡                                  (                                                            c                                              i                        i                                                              |                                          t                      k                                                        )                                                                                        (                  Equation          ⁢                                          ⁢          3                )            where each p(ci|tk) is drawn from the σn×Q probability look-up table generated in training.
Unfortunately, these look-up tables can become very large with commonly used values of n≧5 and σ=8, making it impractical for embedded applications. The present invention comprises a method that can compress such look-up tables allowing the n-tuple method good performance with nominal accuracy loss at 20:1 compression, but which can scale to compressions of more than 5000:1 with only moderate increases in the error rate.
Compression of Joint Probability Tables Using Mixture Models
As with the SNT, conditional and joint probability tables are incorporated in many other on-line handwriting recognition systems for representing relationships between discrete random variables. N-gram language models and Bayesian networks are two such examples. One of the practical problems with such tables is that the table size grows exponentially with the number of random variables.
When such joint probability tables must be compressed, three factors should be considered. First, a compression algorithm should have a high compression ratio. Second, it should not severely degrade recognition accuracy. Third, it should not slow the recognizer so as to compromise real-time responsiveness. Many algorithms have been introduced for image and data communications compression (e.g. arithmetic coding, JPEG).
These methods are generally inappropriate for probability tables because the table data must be randomly accessed with minimal computational cost. In the literature of language model compression, quantization and pruning methods are used. Quantization allows probability terms to be represented with only one or two bytes rather than four. With pruning methods, high order conditional probabilities are approximated with low order ones. Those probability elements that can be approximated reliably are pruned away from tables.
Joint probability tables utilized within the present invention are decomposed into lower-dimensional components and their mixtures. Then, model parameters are quantized into integers of a predetermined size. This algorithm satisfies the three criteria for practical application. It has a high compression ratio. It classifies quickly because only linear operations are employed using integer math.
Distributional Clustering of N-Tuples
Consider the random variable over character classes, C, and its distribution given a particular n-tuple Ti, denoted P(C|Ti). The idea behind distributional clustering of n-tuples is that if two distinct n-tuples, Ti and Tj induce similar class distributions, they can be clustered together and represented by a single distribution that is the weighted average of the individual distributions:
                              P          ⁡                      (                          C              |                                                T                  i                                ⋁                                  T                  j                                                      )                          =                                                            P                ⁡                                  (                                      T                    i                                    )                                            ⁢                              P                ⁡                                  (                                      C                    |                                          T                      i                                                        )                                                      +                                          P                ⁡                                  (                                      T                    j                                    )                                            ⁢                              P                ⁡                                  (                                      C                    |                                          T                      j                                                        )                                                                                        P              ⁡                              (                                  T                  i                                )                                      +                          P              ⁡                              (                                  T                  j                                )                                                                        (                  Equation          ⁢                                          ⁢          4                )            
To be more general, from now on we will use the notion of class distribution given a particular event, Ei, denoted P(C|Ei). Tuples belonging to the same cluster are treated as identical events and induce the same class distribution. Since we now only need to store one distribution per event as opposed to one per distinct n-tuple, this paradigm leads to a compression ratio of σn: M, where M is the number of events. The small overhead of a look up table mapping any n-tuple to an event is in most cases negligible compared to the size of the probability table. Please note, in regards to equation 4 that other methods to those skilled in the art may also be used.
Measuring the Effect of Merging Two Distributions
Given two distributions P(C|Ei) and (C|Ej), the information theoretic measure for the difference between them is the Kullback-Leibler (KL) divergence measure defined as:
                                                        D              (                              P                ⁡                                  (                                      C                    |                                          E                      i                                                        )                                                                    ⁢                                                  ⁢            P_            ⁢                          (                              C                |                                  E                  j                                            )                                )                =                  -                                    ∑                              k                =                1                            Q                        ⁢                          p              ⁡                              (                                                      c                    k                                    |                                                            E                      i                                        ⁢                                          log                      ⁡                                              (                                                                              p                            ⁡                                                          (                                                                                                c                                  k                                                                |                                                                  E                                  i                                                                                            )                                                                                                            p                            ⁡                                                          (                                                                                                c                                  k                                                                |                                                                  E                                  j                                                                                            )                                                                                                      )                                                                                            )                                                                        (                  Equation          ⁢                                          ⁢          5                )            Unfortunately this measure has two undesirable properties: it is not symmetric, and it is infinite when a class has nonzero probability in the first distribution and zero probability in the second. A related measure called “KL divergence to the mean” is defined as:
                                                                                          P                  ⁡                                      (                                          E                      i                                        )                                                                    P                  ⁡                                      (                                                                  E                        i                                            ⋁                                              E                        j                                                              )                                                              ·                              D                (                                  P                  ⁡                                      (                                          C                      |                                              E                        i                                                              )                                                                                        ⁢                          P              ⁡                              (                                  C                  |                                                            E                      i                                        ⋁                                          E                      j                                                                      )                                              )                +                                            P              ⁡                              (                                  E                  j                                )                                                    P              ⁡                              (                                                      E                    i                                    ⋁                                      E                    j                                                  )                                              ·                      D            (                                          P                ⁡                                  (                                      C                    |                                          E                      j                                                        )                                            ⁢                                                                P                  ⁡                                      (                                          C                      |                                                                        E                          i                                                ⋁                                                  E                          j                                                                                      )                                                  )                                                                        (                  Equation          ⁢                                          ⁢          6                )            
In information theoretical terms, this measure can be understood as the expected amount of inefficiency incurred if, instead of compressing two distributions optimally with their own code, we use the code that would be optimal for their mean. This measure not only avoids the two undesirable properties of the classic KL measure, but the measure is also more suitable for clustering as it measures directly the effect of merging two distributions into one. For the purpose of n-tuple clustering in the context of character recognition, we desire to further modify this measure to take into account the cumulative effect of merging two distributions on the final classification. As shown in Equation 3, each n-tuple encountered in the input character is treated as an independent event and the class likelihood of all the events are accumulated to produce the final score.
Thus, the true cost of merging two distributions should be further weighted by the prior probability of the joint event, the less frequently two events are likely to occur, the smaller the impact of merging their distributions. We call this new measure the “weighted mean KL divergence”, it is defined as:Df(Ei,Ej)=P(Ei)·D(P(C|Ei)∥P(C|EiEj))+P(Ej)·D(P(C|Ej)∥P(C|EiEj))  (Equation 7)This is the distance measure we will use to cluster the n-tuple distributions. It is understood that there are many different methods for calculating the difference and/or similarities between two distributions. Any method for measuring distance between distributions known to those skilled in the art can be used.