The invention relates to pattern recognition methods, such as character recognition. In particular, the invention relates to a method that optimizes a recognition dictionary used in pattern recognition to enable the recognition dictionary to be able to better distinguish between patterns that are difficult to distinguish.
Character recognition is typically implemented in the three stages that include preprocessing, feature extraction, and discrimination. During the preprocessing stage, size normalization of the input pattern and noise removal are performed. During the feature extraction stage, feature values that represent the shape of the character are extracted from each character pattern in the input pattern, and a feature vector representing the feature values is generated. Each feature represents a portion of the structure of the character pattern. Typical features include the length of stroke, the angle of stroke, and the number of loops. For example, when the feature is the number of loops, the feature may have one of the following values:
0: when the character pattern belongs to numeral xe2x80x9c1xe2x80x9d, xe2x80x9c2xe2x80x9d or xe2x80x9c3,xe2x80x9d
1: when the character pattern belongs to numeral xe2x80x9c0xe2x80x9d, xe2x80x9c6xe2x80x9d or xe2x80x9c9,xe2x80x9d and
2: when the character pattern belongs to numeral xe2x80x9c8.xe2x80x9d
Typically, many hundreds of feature values are extracted for each character pattern in the input pattern. The feature values are represented by a feature vector whose elements each represent the feature value of one of the features of the character pattern. A feature vector has a large number of dimensions, with 500 dimensions being typical.
During the discrimination stage, the feature vector of each character pattern in the input pattern is compared with a reference vector for each category. The character pattern is determined to belong to the category whose reference vector is closest to the feature vector of the character pattern. In character recognition, each category represents one character. For example, in numeral recognition, a category exists for each of the characters xe2x80x9c0,xe2x80x9d xe2x80x9c1,xe2x80x9d. . . , xe2x80x9c9.xe2x80x9d
The reference vectors are stored in a recognition dictionary. The recognition dictionary is statistically created from character patterns obtained from the handwriting of many people. Such character patterns are called xe2x80x9ctraining patterns.xe2x80x9d Before the character recognition system can be used for handwriting recognition, the recognition dictionary is created by a number of unspecified writers, where each writer provides a handwriting sample that includes a predetermined set of character patterns. The category to which each. of the character patterns in the set belongs is known. The feature vectors extracted from the character patterns in each category are averaged and each average vector is stored in the recognition dictionary as the reference vector for the category.
The effectiveness of a character recognition system is characterized by its recognition ratio. When character recognition is performed, one of the following results is obtained for each character pattern in the input pattern: (1) the category to which the character pattern belongs is correctly recognized; (2) the character pattern is successfully recognized as belonging to a category, but the character pattern is mis-read so that the category is incorrect; or (3) the character pattern is not recognized as belonging to any category. For example, when the character pattern is the numeral xe2x80x9c1,xe2x80x9d result (1) occurs when the character pattern is recognized as belonging to the category xe2x80x9c1,xe2x80x9d result (2) occurs when the character pattern is incorrectly recognized as belonging to the category xe2x80x9c7,xe2x80x9d for example, and result (3) occurs when the category to which the character pattern belongs cannot be recognized. The recognition ratio is the number of character recognition events that generate result (1) divided by the total number of character patterns in the input pattern. A successful character recognition system is one that has a recognition ratio close to unity (or 100%).
Two basic approaches may be used to increase the recognition ratio of a character recognition system. These approaches are:
(1) to describe the distribution of the features of each category as precisely as possible; and
(2) to emphasize the distribution differences between the categories.
Many known approaches to increasing the recognition ratio of character recognition systems concentrate on the first approach. These approaches have been successful, but only to a limited extent.
In Handprinted Numerals Recognition by Learning Distance Function, IEICE Trans. D-11, vol. J76-D-II, no. 9, pp. 1851-59 (September 1993), the inventor described Learning by Discriminant Analysis (LDA), a way of increasing the recognition ratio of character recognition systems based on the second approach. In particular, LDA increases the recognition ratio by reducing the number of incorrectly-recognized character patterns (result (2) above). In the LDA character recognition method, a discriminant function obtained by applying Fisher""s linear discriminant analysis is superposed onto the original distance function between the feature vector of each character pattern in the input pattern and the reference vector of each category. The original distance function may be the weighted Euclidean distance or the quadratic discriminant function between the feature vector and the reference vectors.
Fisher""s linear discriminant analysis is applied between the reference vector of each category and the feature vector of a rival pattern for the category. A rival pattern for category A, for example, is defined as a character pattern that belongs to a different category, e.g., category B, but is incorrectly recognized as belonging to category A. In this example, the rival pattern is incorrectly recognized as belonging to category A when the Euclidian distance between the feature vector of the rival pattern and the reference vector of category A is less than that between the feature vector of the rival pattern and the reference vector of category B, the category to which the rival pattern actually belongs.
In LDA, the linear discriminant analysis uses both the linear terms and the quadratic terms of the feature vector as linear terms. By applying the LDA pattern recognition method, multiple parameters in the distance function, such as the reference vector, the weighting vector and the constant term, can be determined at the same time. The use of the weighted Euclidean distance as the original distance function will be described in greater detail below.
The weighted Euclidean distance D(x) between the feature vector of a character pattern and the reference vector of a category can be described as follows:                               D          ⁡                      (            x            )                          =                              ∑                          m              =              1                        M                    ⁢                      xe2x80x83                    ⁢                                                    ω                m                            ⁡                              (                                                      x                    m                                    -                                      μ                    m                                                  )                                      2                                              (        1        )            
where x=(x1, . . . , xM)t represents the feature vector of the character pattern
xcexc=(xcexc1, . . . , xcexcM)t represents the reference vector of the category, and
xcfx89=(xcfx891, . . . , xcfx89M)t represents the weighting vector, and
t denotes a transposition factor.
Subscripts denoting the index of the category have been omitted from equation (1) to simplify it.
To obtain the discriminant function F(x), LDA first performs a character recognition operation on an input pattern composed of a large set of training patterns. Each training pattern is determined to belong to the category for which the value of D(x) is lowest. The results of the character recognition operation are analyzed to identify, for each category, the training patterns that are incorrectly recognized as belonging to the category. The training patterns that are incorrectly recognized as belonging to a category constitute the rival pattern set for the category. The training patterns that are defined as belonging to each category constitute the in-category pattern set for the category. For example, training pattern x is defined as belonging to category A because the writer who wrote training pattern x did so in response to a request to write a pattern that belongs to category A. Membership of a training pattern in an in-category pattern set is independent of the recognition result of the character recognition operation.
LDA next applies linear discriminant analysis between the in-category pattern set of each category and its corresponding rival pattern set determined as just described. The linear discriminant analysis uses a 2M-dimensional vector y whose components are defined as:
ym=(xmxe2x88x92xcexcm)2
yM+m=(xmxe2x88x92xcexcm)
The discriminant function F(x) can be written as follows:                                                                         F                ⁡                                  (                  x                  )                                            =                              xe2x80x83                            ⁢                                                                    ∑                                          m                      =                      1                                        M                                    ⁢                                      xe2x80x83                                    ⁢                                                            a                      m                                        ⁢                                          y                      m                                                                      +                                                      ∑                                          m                      =                      1                                        M                                    ⁢                                      xe2x80x83                                    ⁢                                                            b                      m                                        ⁢                                          y                                              M                        +                        m                                                                                            +                c                                                                                        =                              xe2x80x83                            ⁢                                                                    ∑                                          m                      =                      1                                        M                                    ⁢                                      xe2x80x83                                    ⁢                                                                                    a                        m                                            ⁡                                              (                                                                              x                            m                                                    -                                                      μ                            m                                                                          )                                                              2                                                  +                                                      ∑                                          m                      =                      1                                        M                                    ⁢                                      xe2x80x83                                    ⁢                                                            b                      m                                        ⁡                                          (                                                                        x                          m                                                -                                                  μ                          m                                                                    )                                                                      +                c                                                                        (        2        )            
It can be seen that the second form of F(x) is a quadratic equation of the form ax2+bx+c.
The constant c always has a negative value. The way in which the coefficients am and bm are determined will be described below.
The discriminant function F(x) less than 0 when the character pattern belongs to the category, and F(x) greater than 0 when the character pattern belongs to the rival pattern set.
The modified Euclidian distance G(x) is defined as:
G(x)=D(x)+xcex3F(x)xe2x80x83xe2x80x83(3)
where xcex3 is a positive coefficient whose value is determined experimentally to maximize the recognition ratio when character recognition is performed using the modified Euclidian distance G(x).
By adding xcex3F(x) to D(x), the original weighted Euclidian distance value is modified in such a manner that the distance between the feature vector of the character pattern and the reference vector of the category to which the character pattern belongs is reduced, and the distance between the feature vector of the character pattern and the reference vectors of the categories to which the character pattern does not belong is increased. This enables a character pattern that was incorrectly recognized when the original weighted Euclidian distance D(x) was used to be recognized correctly when the modified Euclidian distance G(x) is used.
G(x) can be written as:                                                                         G                ⁡                                  (                  x                  )                                            =                                                                    ∑                                          m                      =                      1                                        M                                    ⁢                                      xe2x80x83                                    ⁢                                                                                    ω                        m                        xe2x80x2                                            ⁡                                              (                                                                              x                            m                                                    -                                                      μ                            m                            xe2x80x2                                                                          )                                                              2                                                  +                d                                                                                        =                                                                    ∑                                          m                      =                      1                                        M                                    ⁢                                      xe2x80x83                                    ⁢                                                            (                                                                        ω                          m                                                +                                                  Δ                          ⁢                                                      xe2x80x83                                                    ⁢                                                      ω                            m                                                                                              )                                        ⁢                                                                  (                                                                              x                            m                                                    -                                                      (                                                                                          μ                                m                                                            +                                                              Δ                                ⁢                                                                  xe2x80x83                                                                ⁢                                μ                                ⁢                                                                  xe2x80x83                                                                ⁢                                m                                                                                      )                                                                          )                                            2                                                                      +                d                                                                        (        4        )            
where:                               Δ          ⁢                      xe2x80x83                    ⁢          ω                =                  xe2x80x83                ⁢                  ya          m                                        Δμ        =                  xe2x80x83                ⁢                                            -                              yb                m                                      /            2                    ⁢                      (                                          ω                m                            +                              Δ                ⁢                                  xe2x80x83                                ⁢                                  ω                  m                                                      )                                                  ≈                  xe2x80x83                ⁢                              -                          yb              m                                /                      (                          2              ⁢                              ω                m                                      )                                                            d          =                      xe2x80x83                    ⁢                      yc            +                                          ∑                                  m                  =                  1                                M                            ⁢                              xe2x80x83                            ⁢                              d                m                                                    ,        and                                          d          m                =                  xe2x80x83                ⁢                              -                          (                              1                4                            )                                ⁢                                    ∑                              m                =                1                            M                        ⁢                          xe2x80x83                        ⁢                                                            (                                      yb                    m                                    )                                2                            /                              (                                                      ω                    m                                    +                                      Δ                    ⁢                                          xe2x80x83                                        ⁢                                          ω                      m                                                                      )                                                        
The form of the modified Euclidian distance function G(x) is the same as that of the original weighted Euclidian distance function D(x), except for the addition of the constant term d, and the m-th components of the reference vector and the weighting vector being modified by xcex94xcexcm and xcex94xcfx89m, respectively. This means that the parameters of the distance function can be learned using the rival pattern set for the category. The modified reference vectors, weighting vectors, and constant terms are stored in the recognition dictionary when determining G(x).
The quadratic coefficients am, bm and c that define F(x) are determined as follows. A vector a is defined as:
a=(a1, . . . ,aM,b1, . . . ,bM)t
a=xcexa3xe2x88x921(xcexcRxe2x88x92xcexcN)
c=at(xcexcR+xcexcN)/2
where:
xcexa3 is the in-category covariance matrix for the vector y,
xcexcN the average feature vector, i.e., the reference vector, of the category, and
xcexcR is the average feature vector of the rival pattern set of the category.
The following is an example of the way in which xcexa3 may be defined:
xcexa3={(nSxe2x88x921)SS+(nRxe2x88x921)SR}/(nS+nRxe2x88x922)
where:
SS is the covariance matrix of the training patterns belonging to the category,
SR is the covariance matrix of the training patterns constituting the rival pattern set of the category,
nS is the number of training patterns belonging to the category, and
nR is the number of training patterns constituting the rival pattern set.
Once values for the quadratic coefficients am, bm and c of F(x) have been calculated, the value of F(x) can then be calculated for each category.
The optimum value of xcex3 is determined by performing successive character recognition operations on the training patterns in the training pattern set using D(x), F(x) and a different value of xcex3, and determining the recognition ratio of each character recognition operation. The optimum value of xcex3 is that which gives the greatest recognition ratio.
The key to successfully using LDA is to determine F(x) correctly. To obtain F(x) using LDA, the pattern sets of two categories are projected onto the one dimensional axis z=F(x), and the discriminant function F(x) that maximizes the Fisher criterion is determined. The Fisher criterion is defined as the ratio of the squared distance T2 between the averages of each category distribution on the z-axis to the within-category variance, which is defined as summation of each category variance (s12+s22) on the z-axis, i.e., T2/(s12+s22).
One problem encountered when applying the above-described technique concerns the symmetry of the distributions of the pattern sets of the two categories when the pattern sets are projected onto the z-axis. It is believed that the pattern distribution on the z-axis cannot be symmetrical because the features used for character recognition are distributed asymmetrically. It is well known that many elements of the feature vectors used in character recognition are distributed with a positive skew rather than a negative skew. Even if the pattern sets of two categories are symmetrically distributed in the feature space, the linear discriminant analysis uses quadratic terms which cause asymmetric distributions on the z-axis.
This is illustrated by a simple example. Suppose that the pattern set of a given category is normally distributed with a covariance of 1. Consider the case of am=1 and bm=0. In this case, F(x) represents the Euclidean distance, so that the distribution on the z-axis is equivalent to the distribution of the Euclidean distance. The probability density distribution of the Euclidean distance p(z) is expressed as:
p(z)=z(Mxe2x88x9221/2exe2x88x92z/2/(2M/2xcex93(M/2))xe2x80x83xe2x80x83(5)
This formula is found in K. Fukunaga, Introduction to Statistical Pattern Recognition, Second Edition, Academic Press Inc. (1990). xcex93 represents the gamma function. Equation (5) indicates that the probability density has a gamma distribution. The probability density of a gamma distribution is known to be asymmetric.
In the general case for arbitrary values of am and bm, it is difficult to determine the distribution analytically. However, it is thought that the distribution is never symmetric. When distributions are asymmetric on the z-axis, Fisher""s criterion can be maximized for the total of all character patterns, but it is not necessarily maximized for character patterns that are easily confused, even though is highly desirable to distinguish between such character patterns. In other words, the original LDA technique does not generate an optimum discriminant function for character patterns that are easily confused, and the ability of the modified Euclidian distance G(x) generated by the original LDA technique to increase the recognition ratio of character recognition is therefore limited.
In Zur verstxc3xa4rkten Berxc3xccksichtigung schelcht erkennbarer Zeichen in der Lernstichtprobe, 45 WISSENSCHAFTLICHE BERICHTE AEG-TELEFUNKEN No. 1, 97-105, D. Becker and J. Schxc3xcrmann describe a method for improving the effectiveness of the recognition dictionary of a non LDA-based character recognition system. In the character recognition system described by Becker et al., unlike in an LDA-based character recognition system, the effectiveness of the recognition dictionary depends entirely on the constituents of the training pattern set. Consequently, the method described in the article increases the effectiveness of the recognition dictionary by artificially generating a new training pattern set. The new training pattern set is generated by including an error sample composed of hard-to-recognize patterns in the existing training pattern set with increased weighting and generating a new recognition dictionary from the new training pattern set. The new recognition dictionary is generated by (1) using an existing recognition dictionary to perform a pattern recognition operation on the training pattern set; (2) forming an error sample composed of (a) all patterns from all categories misrecognized by the pattern recognition operation, and (b) some of the patterns nearly misrecognized by the pattern recognition operation; (3) modifying the training pattern set by including the members of the error sample with an increased weighting; (4) generating the new recognition dictionary using the modified training pattern set; and (5) repeating steps (1), (2), (3) and (4) using the new recognition dictionary in step (1) until the recognition ratio converges.
Unfortunately, there is nothing in the disclosure of Becker et al. that indicates that the method described therein could offer an effective solution to the above-mentioned asymmetrical distribution problem of LDA-based recognition systems.
It would be advantageous to provide a way of improving the discrimination ability of the discriminant function F(x) and the modified Euclidian distance G(x) by enabling Fisher""s criterion to be maximized with respect to character patterns that are easily confused.
The invention solves the above problem by enabling Fisher""s criterion to be maximized with respect to patterns that are easily confused, and therefore improves the discrimination ability of F(x) and G(x). As described above, formalizing the distribution on the z axis is difficult, and this problem is hard to solve analytically. To solve this problem, the invention provides the following method:
A discriminant function F(x) is defined by performing a conventional Learning by Discriminant Analysis operation and a value of the discriminant function is determined for all the training patterns in the in-category pattern set of each category and for all the training patterns in the rival pattern set of the category. The in-category pattern set is composed of all the training patterns defined as belonging to the category. The rival pattern set is composed of the training patterns that belong to other categories and that are incorrectly recognized as belonging to the category.
An in-category pattern subset and a rival pattern subset are then formed for each category. The in-category pattern subset for the category is formed by selecting a predetermined number of the training patterns that belong to the in-category pattern set and that, among the training patterns that belong to the in-category pattern set, have the largest values of the discriminant function F(x). The rival pattern subset for the category is formed by selecting a predetermined number of the training patterns that belong to the rival pattern set of the category and that, among the training patterns that belong to the rival pattern set, have the smallest values of the discriminant function F(x).
A linear discriminant analysis operation is then performed on the in-category pattern subset and the rival pattern subset to obtain a new discriminant function.
This processing increases the discrimination ability of the discriminant function. Consequently, a significant improvement in the pattern recognition ratio is achieved.