1. Field of the Invention
The present invention relates to character recognition of a language having many characters, such as Japanese, Chinese, or Korean.
2. Description of the Related Art
In character recognition of a language having many characters, such as Japanese, Chinese, or Korean, there is adopted such a method that features are extracted from an input pattern to extract a feature vector, distances between the feature vector and reference vectors previously extracted for all target characters to be recognized are calculated, and a character corresponding to a reference vector having a smallest distance value is selected as a recognized character.
Japanese Patent Unexamined Publication No. Hei. 2-186490 discloses a system for performing character recognition by calculating distances between a vector extracted from a pattern of an input character and reference vectors extracted from patterns of previously stored target characters. In this system, the Euclidean distance between the vector of the input pattern and the reference vector is calculated and is compared with a predetermined threshold to perform character recognition.
Japanese Patent Unexamined Publication No. Hei. 4-286087 discloses a system for performing character recognition by extracting a feature vector from an input character pattern and calculating the Euclidean distance from a reference vector stored in a feature dictionary, in which the feature dictionary is divided into clusters for respective similar character categories, a distance between the feature vector of the input pattern and a reference vector representing each cluster is calculated, and detailed recognition processing is performed as to reference vectors of similar characters belonging to a cluster having a smallest distance.
In general, in a language having many characters, such as Japanese, Chinese, or Korean, a very large number of features, for example, several hundreds or several thousands features, are used to improve recognition accuracy. In a character recognition system based on a distance from a reference vector corresponding to each candidate character, in general, it takes a calculation time in proportion to the number of candidate characters and the number of features, so that a drop in recognition speed becomes a problem. Specifically, as a distance, there is the Euclidean distance, the weighted Euclidean distance, the city block distance, or the like,
                            ⁢                  Euclidean          ⁢                                          ⁢          distance                                            ⁢                              ∑                          i              =              1                        m                    ⁢                                          ⁢                                    (                                                x                  i                                -                                  r                  i                                            )                        2                                                  weighted        ⁢                                  ⁢        Euclidean        ⁢                                  ⁢        distance                                    ⁢                              ∑                          i              =              1                        m                    ⁢                                          ⁢                                                    w                i                            ⁡                              (                                                      x                    i                                    -                                      r                    i                                                  )                                      2                                                          ⁢                  city          ⁢                                          ⁢          block          ⁢                                          ⁢          distance                                            ⁢                              ∑                          i              =              1                        m                    ⁢                                          ⁢                                                                x                i                            -                              r                i                                                                    Where,    X=(x1, . . . , xm): feature vector of input pattern    Rj=(Rj1, . . . , rjm): reference vector of j-th candidate character    W=(w1, . . . , wm): weight vector of feature    m: the number of features    n: the number of target characters
In any case, it takes calculation of the distance element (xi−r1)2, |x1−ri| with respect to the respective features n×m times, that is, (the number of candidate characters)×(the number of features) times.
In the foregoing Japanese Patent Unexamined Publication No. Hei. 4-286087, a calculation of the Euclidean distance is restricted to the similar character cluster, so that the processing speed is improved. However, predictions are that it is difficult to properly determine a representative vector as a standard of selection of the similar character cluster, and predictions are that recognition accuracy is lowered according to the quality of the representative vector.