1. Field of the Invention
The invention relates to a clustering system, and more particularly to a clustering system which is useful in classifying character image data into a predetermined number of font classes in an optical character reader (hereinafter, referred to as "an OCR").
2. Description of the Prior Art
In an OCR, generally, character image data are clustered by the K-means method or another method which is useful in the multivariate analysis. These methods are roughly classified into two types. One is a method in which the number of classes and representative vectors of respective classes are input as initial values. The other is a method in which the number of classes and representative vectors are automatically obtained.
Among these methods, the K-means method in which initial values are to be input is most frequently used. In the K-means method, distances between each of feature vectors of character image data and each of representative vectors input as initial values is calculated. For each of the feature vectors, the representative vector which is closest to the respective feature vector is obtained from the result of this distance calculation. Then, each of the feature vectors is assigned to a class to which the closest representative vector belongs. Thereafter, an average vector is calculated for each of the classes, and the average vector is set as a new representative vector of the class. The distance between each of feature vectors and each of new representative vectors is calculated, and the feature vectors are classified again. The calculation of the distance between each of the feature vectors and each of the new representative vectors, and the classification of the feature vectors is repeated until any of the feature vectors is not exchanged between the classes as a result of the classification (i.e., until the convergence is obtained).
In the K-means method, however, an average vector for one class is obtained from the feature vectors which are assigned to the class, on the basis of the distance calculation with the representative vectors, and this average vector is used as a new representative vector. If an irregular vector is accidentally input or assigned to a class, therefore, the average vector (i.e., the new representative vector) of the class shifts. As a result, in the next classification, another irregular vector may be assigned to the class. This causes a problem in that, after the convergence is obtained, an initial representative vector (an initial value) does not exist for the class.
The clustering process using font names as the unit of classification can be effectively applied to some kinds of characters which have a small number of fonts, such as Japanese characters (in printing Japanese characters, a small number of fonts such as Ming, Gothic and textbook types are usually employed). In contrast, alphanumeric characters have a huge number of fonts (1,000 or more), and handwritten characters have countless fonts (the number of which corresponds to the number of writers). Therefore, it is difficult to cluster alphanumeric or handwritten characters by using font names. Even in a process of clustering such characters, however, one can easily determine in a intuitive manner the number of classes and representative vectors of the respective classes. When printed alphanumeric characters are to be clustered, for example, one can define a class in which the representative vector is an ordinary character used in ordinary books, and another class in which the representative vector is a fat character (e.g., a bold type one). When handwritten characters are to be clustered, a class in which the representative vector is a left-inclined "1", another class in which the representative vector is a right-inclined "1", etc. may be defined.
Even though the number of classes and the initial values of representative vectors are intuitively determined, as long as the K-means method is used, there still arises a problem in that the initial value of the representative vector for a class does not exist after converged. Therefore, a person who determines the initial value will find the disadvantage of the conventional method. Furthermore, even if the representative vectors for respective classes obtained after the convergence are registered as standard fonts, and the recognition process is performed using these standard fonts, the improvement of the recognition rate cannot be attained.