1. Field of the Invention
The present invention relates to a data classifier for classifying various data, such as sensory data including image and voice information.
2. Description of the Related Art
In recent years, the amount of information people receive has rapidly increased with the spread and advancement of information devices. In this environment, in order to facilitate selection of desired information, there is a strong demand for techniques for recognizing and classifying information without any human intervention.
To address this demand, there is a known clustering method wherein data to be classified are compared and similar data are classified into groups of data. For the determination of similarity, various methods are known such as, for example, a maximum likelihood method, a K-means method, a merge method, and an MDS (Multi-Dimensional Scaling) method. These clustering methods all require human execution of processes such as parameter setting.
On the other hand, as a method for performing the clustering process relatively autonomously, a method is known wherein input image data which is one of pattern data is classified and sorted on a lattice space map. For this classification and sorting, for example, a self-organizing feature mapping (hereinafter abbreviated simply as “SOM”) is used (T. Kohonen, Self-organizing formation of topologically correct feature maps, Biological Cybernetics, 1982). The SOM is a network having two layers consisting of an input layer in which data is input and a competitive layer of a lattice space map. The input is weighted and input to each lattice. A group of weights for the input components is called a weight vector.
At first, the weight vector is initialized through the following process. As described in the Kohonen reference described above, a number of input vectors are selected at random from among a plurality of input vectors (corresponding to a feature set at this point) which is a target for learning, the number being identical to a number of prototypes, and the weight vectors for the lattices are initialized. Also according to Kohonen, it is also possible to randomly set initial values for the weight vectors.
Next, a learning process is performed for the weight vectors. During the learning steps, a feature set for learning is generated and a predetermined measured value (for example, Euclid distance) is calculated from the learning feature set and the weight vectors of lattices in the lattice space. From among the lattices, a lattice having the maximum correlation (minimum measure) is found (this lattice is called a “winning node”). For lattices located in the neighborhood of this lattice (winning node) in the lattice space, the weight vector of each of these lattices is adjusted such that the measured value between the learning feature set and the lattice is reduced. After repeating the learning process while adjusting the weight vectors in such a manner, lattices having minimum values with respect to a feature set made of features that are similar to each other become concentrated in a particular area, so that a condition can be obtained which can be applied for data classification. In this process, the selection of lattices in which the weight vectors are to be adjusted is made depending on the distance on the map from the winning node. It is preferable that the amount of adjustment be variable depending on the distance from the winning node c and the magnitude of the amount of adjustment also be changeable. In general, the weight vector w is adjusted based on the following equation (1) so that the vector becomes more similar to the weight vector I of a neighborhood node:
[Equation (1)]wj(t+1)−wj(t)+hcj[I(t)−w(t)]  (1)wherein[Equation 2]
      h    cj    =            α      ⁡              (        t        )              ·          exp      ⁡              [                                                                                            r                  c                                -                                  r                  j                                                                    2                                2            ·                                          (                                  σ                  ⁡                                      (                    t                    )                                                  )                            2                                      ]            in which α(t) represents a parameter known as a learning coefficient which controls the magnitude of the amount of adjustment and σ(t) represents a function referred to as a neighborhood function which determines the variation in the range for adjusting the weight vectors, both of which monotonically decreases with respect to time t. Adjustment according to equation (1) is performed for all lattices that belong in a range of an inter-node distance of Rmax on the map from the winning node whereinRmax≧∥rc−rj∥  [Equation 3]With repetition of learning, the value of Rmax decreases as a result of influence of the neighborhood function σ(t). As the neighborhood function σ(t), a function such as a triangular type function, a rectangular (quadrangular) type function, and a Mexican hat type function can be used. It is also known that the selection of the neighborhood function σ(t) also influences the learning results. The parameter “t” represents “time step” and is incremented every time a feature set is input. The factor ∥rc−rj∥ represents a norm (distance) between the winning node and the node in which the weight vector is to be adjusted.
Simple application of the above technique, however, does not allow immediate execution of autonomous data classification. In order to realize autonomous data classification, the appropriateness of the lattice space map must be determined after completion of the learning process. In other words, (1) a method for obtaining an optimum lattice space map is required. In addition, when data is to be classified using the lattice space map after the learning process, it is appropriate to create, in the lattice space, boundaries which form the basis for classification and to classify data given as the classification target based on where the lattice having the minimum measure with respect to the feature set corresponding to the data is located relative to the boundaries (regions in the lattice space separated by the boundaries will be referred to simply as “clusters” hereinafter). That is, (2) a method for determining the boundaries of clusters is also required.
Among these required methods, as (1) a method for obtaining an optimum lattice space map, Kohonen proposes a method for selecting a map in which the average quantization error is minimum. That is, from among a plurality of lattice space maps formed using different learning conditions, a map having the minimum average quantization error is selected and is used as an approximated optimum lattice space map. In this method, the topology of the space of the input feature set is not reflected in the topology of the map. In other words, the degree of preservation of topology is low. This may lead to erroneous classification depending on the method for clustering.
As a method which takes into consideration the preservation of topology, a technique for forming an appropriate map by monitoring a predetermined indication called a topological function (topographic function) to control the learning conditions (Auto-SOM) has also been developed. However, the calculation of the topographic function itself is a heavily loaded process, and therefore, there is a problem in that the learning time increases.
As (2) a method for autonomously determining the boundaries of clusters, a method known as a U-matrix method (Unified Distance Matrix Method) and a method known as a potential method are both under development. The U-matrix method is described in detail in A. Ultsch et al., “Kno ledge Extraction from Artificial Neural Networks and Applications”, Proc. Transputer Anwender Treffen/World Transputer Congress TAT/WTC 93 Aachen, Springer 1993. In the U-matrix method, a sum of the absolute values of differences between the corresponding components of the weight vectors of the two lattices or the root-mean square of the differences is defined as the distance between two adjacent lattices on a map. With such a definition, the distance between adjacent lattices that are each strongly associated (that is, these lattices have weight vectors which are close to the feature set; these lattices will herein after be described as “prototyped to the feature set”) with feature sets having a high similarity, that is, the distance between adjacent lattices that are prototyped to two feature sets having a high similarity, is small. In contrast, the distance between adjacent lattices that are each prototyped to two feature sets having a low similarity is large. Considering a three-dimensional surface with the height representing the magnitude of the distance, the height of a surface corresponding to a distance between lattices each prototyped to feature sets having a high similarity will be low and a “valley” is formed, whereas the height of a surface corresponding to a distance between lattices prototyped to feature sets having a low similarity will be high and a “hill” is formed. Therefore, by forming the boundaries along the “hills”, it is possible to define a group (cluster) of lattices that are prototyped to feature sets having a high similarity. The U-matrix method can be considered as a method for compensating a disadvantage of the self-organizing map that the distance in the input space is not preserved.
The U-matrix method, however, suffers a problem in that although it is possible to define the boundaries when the height differences between the “hills” and “valleys” are significant, in many actual information processes, the height differences between the “hills” and “valleys” are not as significant as desired, and the height of the three-dimensional surface varies rather gradually. In such cases, manual setting of the boundaries is necessary. Therefore, the U-matrix method in some cases does not allow autonomous determination of boundaries.
The “potential method” is disclosed in D. Coomans, D. L. Massart, Anal. Chem. Acta., 5-3, 225-239 (1981). In the potential method, a probability density function of a population which approximately represents input data is estimated using a predetermined potential function and by superposing a value of a function corresponding to input data, and the regions where the amount of superposition is small are determined as the boundaries. As the potential function, a Gaussian type function is commonly used. More specifically, for a group of input data made of N input vectors each having K dimensions, average potentials received by first input data from the other input data (contribution of the first input on the overall input group) ψ1 is defined using the following equations (2) and (3).
[Equation 4]
                              Ψ          i                =                              N                          -              1                                ⁢                                    ∑                              g                =                1                            N                        ⁢                                                  ⁢                          Φ                              l                .                g                                                                        (        2        )                                                                                                                          Φ                                          l                      .                      g                                                        =                                                                                    [                                                                                                            (                                                              2                                ⁢                                π                                                            )                                                                                      K                              /                              2                                                                                ·                                                      α                            K                                                                          ]                                                                    -                        1                                                              ⁢                                                                  exp                        [                                                  -                                                                                    (                                                              2                                ⁢                                α                                                            )                                                        2                                                                          )                                                                    -                        1                                                              ⁢                                                                  ∑                                                  k                          =                          1                                                K                                            ⁢                                                                                          ⁢                                                                        (                                                                                    x                              kl                              ′                                                        -                                                          x                              kg                              ′                                                                                )                                                2                                                                                            ]                            ⁢                                                                                                                      wherein              ⁢                                                                                                                                                        x                  kl                  ′                                =                                                      (                                                                  x                        kl                                            -                                                                        x                          _                                                k                                                              )                                                        σ                    k                                                              ,                                                                    x                    _                                    k                                =                                                      N                                          -                      1                                                        ⁢                                                            ∑                                              l                        =                        1                                            N                                        ⁢                                                                                  ⁢                                          x                      kl                                                                                  ,                                                σ                  k                                =                                                      [                                                                  ∑                                                  l                          =                          1                                                N                                            ⁢                                                                                                    (                                                                                                                  ⁢                                                                                          x                                kl                                                            -                                                                                                x                                  _                                                                k                                                                                      )                                                    2                                                ⁢                                                  (                                                      N                            -                            1                                                    )                                                                                      ]                                                        1                    /                    2                                                                                                          (        3        )            In these equations, xk1 represents a k-th component of the first input and α represents a smoothing parameter which affects the number of clusters to be classified. Therefore, in the potential method, optimization of distribution function for which the distribution shape is to be assumed and optimization of various parameters are required for each input vector group, that is, knowledge concerning the characteristics of the data to be classification is required in advance, and manual adjustment is therefore required. In addition, in the potential method, as the dimension of the feature set obtained from the input data becomes higher, more samples will be required for determining the appropriate probability density distribution, and therefore the potential method suffers from a problem in that it is difficult to apply the method to a map having only a small number of lattices. In other words, the potential method also does not always ensure autonomous determination of boundaries.
To solve the above-described problems, various techniques have been studied, such as the techniques disclosed in Japanese Patent Laid-Open Publication No. Hei 7-234854, Japanese Patent Laid-Open Publication No. Hei 8-36557, and “Unsupervised Cluster Classification using Data Density Histogram on Self-Organizing Feature Map”, papers of the Institute of Electronics, Information, and Communication Engineers, D-II Vol. J79-DII No. 7, pp. 1280-1290, July, 1996. However, each of these techniques presumes that the features to be used for the classification are prototyped to lattices with sufficient distance either in the structure of the input data or in the mapping results. When there is variation in a difference between or an overlapping of distribution shapes for each feature to be classified or in the distance between center of masses of the positions on the map of lattices which are prototyped to the feature, which are common to image data classification, for example, the boundaries of clusters become mingled in a complicated manner on the map and appropriate clustering process cannot be performed.
In addition, in the related art methods, the number of lattices on the map is determined through research and experience in, and there has been no consideration regarding selection of an appropriate number of lattices suitable for actual usage. However, when the number of lattices is less than an appropriate number, there are some cases where the lattices in the section of the cluster boundaries become strongly associated with a feature set which should belong to another cluster, in which case classification error tends to occur more frequently. For this purpose, a technique for increasing or decreasing the number of lattices such that the average quantization error becomes lower than a predetermined number is disclosed in James S. Kirket al., “A Self-Organized Map with Dynamic Architecture for Efficient Color Quantization”, IJCNN '01, 2128-2132. In this technique, however, lattices that image data distribution in the space of a feature set corresponding to the input data are added and the like, and there is no consideration for increasing, for example, the number of lattices in the neighborhood of the cluster boundaries, which is important in data classification. As such, it is also possible to increase the number of lattices from the beginning of the process, but this configuration inevitably leads to an increase in calculation time, and therefore, is not practical.
Similarly, when, for example, input data (pattern data) is to be directly classified into clusters without the use of prototypes, there is a method for classifying a group of pattern data into clusters based on statistical characteristics in the group of pattern data. Regarding the statistical characteristics, for example, various methods are known such as a method wherein the statistical distribution parameters are sequentially estimated through Bayes' learning and a method using a potential function. However, estimation of the statistical characteristics in this manner requires that information (for example, label) which acts as a hint for clustering be added to the input pattern data, because the pattern data must be provisionally classified for each hint information and the estimation for the statistical distribution is calculated for each classification.
To this end, it is also possible to calculate degrees of similarity between individual pattern data using a predetermined function, analyze the structure of pattern data space, and apply a clustering process according to the structure resulting from the analysis. As this type of method, a K-means method and a dividing and merging method (commonly referred to as the “ISODATA method”) are known, but these methods require manual setting of parameters. More specifically, in the K-means method, a final cluster number indicating the number of clusters into which the group of pattern data is to be divided must be manually set. There is also a problem in that the clustering result is highly sensitive to the setting of a parameter known as a cluster center value and that the quality of the clustering results is determined based on the set values.
Similarly, in the dividing and merging method, parameter settings for a number of parameters such as a cluster removal threshold value, a cluster division threshold value, and a cluster merge threshold value are required and the clustering results are significantly affected by the setting of these parameters.
The present invention was conceived to solve the above-described problem and an advantage of the present invention is that a data classifier is provided in which autonomous clustering process can be performed.