1. Field of the Invention
The present invention relates to a learning method and apparatus, a recognition method and apparatus, a generating method and apparatus, and a computer program and, in particular to a learning method and apparatus, a recognition method and apparatus, a generating method and apparatus, and a computer program for learning time series data, such as a voice, in an unsupervised learning fashion, and recognizing and generating time series data using the learning results.
2. Description of the Related Art
Techniques recognizing a pattern by learning are generally called pattern recognition. The learning techniques by pattern recognition are divided into supervised learning and unsupervised learning.
In the supervised learning, information concerning class to which learning data of each pattern belongs to is provided. Such information is called correct-answer label. The learning data belonging to a given pattern is learned on a per pattern basis. Numerous learning methods including template matching, neural network, and hidden Markov model (HMM), have been proposed.
FIG. 1 illustrates known a supervised learning process.
In the supervised learning, learning data for use in learning is prepared according to assumed category (class), such as phoneme category, phonological category, or word category. To learn voice data of pronunciations of “A”, “B”, and “C”, a great deal of voice data of pronunciations of “A”, “B”, and “C” is prepared.
A model used in learning (model learning data of each category) is prepared on a per category basis. The model is defined by parameters. For example, to learn voice data, an HMM is used as a model. The HMM is defined by a state transition probability of transitioning from one state to another state (including an original state) or an output probability density representing the probability density of an observed value output from the HMM.
In the supervised learning, the learning of each category (class) is performed using learning data of that category alone. As shown in FIG. 1, a model of category “A” is learned using learning data of “A” only, and a model of category “B” is learned using learning data of “B” only. Likewise, a model of category “C” is learned using learning data of category “C” only.
In the supervised learning, the learning of the model of a category needs to be performed using the learning data of that category. The learning data is prepared on a category by category basis, and the learning data of that category is provided for learning of the model of that category. The model is thus obtained on a per category basis. More specifically, in the supervised learning, a template (model of a class (category) represented by a correct-answer label) is obtained on a per class basis.
During recognition, a correct-answer label of a template (having the maximum likelihood) most appropriately matching data to be recognized is output.
The unsupervised learning is performed with no correct-answer label provided to the learning data of each pattern. For example, learning methods using template matching and neural net are available in the unsupervised learning. The unsupervised learning is thus substantially different from the supervised learning in that no correct-answer label is provided.
The pattern recognition is considered as a quantization of signal space in which data (signal) to be recognized in the pattern recognition is observed. If the data to be recognized is a vector, the pattern recognition is referred to as a vector quantization.
In the vector quantization learning, a representative vector corresponding to class (referred to as a centroid vector) is arranged in the signal space where the data to be recognized is observed.
K-means clustering method is available as one of typical unsupervised learning vector quantization techniques. In the K-means clustering method, the centroid vector is placed appropriately in the initial state of the process. A vector as the learning data is assigned to the centroid vector closest in distance thereto, and the centroid vector is updated with a mean vector of the learning data assigned to the centroid vector. This process is iterated.
Batch learning is also known. In the batch learning, a large number of learning data units is stored and all learning data units are used. The K-mean clustering method is classified as batch learning. In online learning, as opposed to the batch learning, learning is performed using learning data each time the learning data is observed, and parameters are updated bit by bit. The parameters include a component of the centroid vector and a output probability density function defining HMM.
Self-organization map (SOM), proposed by T. Kohonen, is well defined as the online learning. In the SOM learning, the weight of connection between an input layer and an output layer is updated (corrected) bit by bit.
In the SOM, the output layer has a plurality of nodes, and each node of the output layer is provided with a connection weight representing the degree of connection with the input layer. If the connection weight serves as a vector, the vector quantization learning can be performed.
More specifically, a node having the shortest distance between the vector as the connection weight and the vector as the learning data is determined as a winner node from among the nodes of the output layer of the SOM. Updating of the vector is performed so that the vector as the connection weight of the winner node becomes close to the vector as the learning data. The connection weight of a node in the vicinity of the winner node is also updated so that the connection weight becomes a bit closer to the learning data. As learning process is in progress, nodes are arranged in the output layer so that nodes having similar vectors as connection weights are close to each other while nodes not similar to each other are far apart from each other. A map corresponding to a pattern contained in the learning data is thus organized. As learning is in progress, a map corresponding to the learning data containing similar nodes (namely, nodes having similar vectors as connection weights) in close vicinity is produced. This process is referred to as self-organization.
The vector of the connection weight obtained as a result of learning is considered as a centroid vector arranged in the signal space. In the K-mean clustering technique, only a vector closest in distance to the learning data is updated, and the method of updating in that way is referred to as winner-take-all (WTA). In contrast, in the SOM learning, not only the node closest to the learning data (winner node) but also the node in the vicinity of the winner node is updated in connection weight. The method of updating is referred to as soft-max adaptation (SMA). The learning results of the WTA learning tends to be subject to localized solution while the SMA learning improves the problem of being subject to localized solution.
Besides the SOM learning, neural gas algorithm is well known as the SMA technique. In the neural gas algorithm, the output layer used in the SOM learning is not used, and closeness is defined by a ranking based on distance to the learning data. Parameters are is on-line learned in a manner similar to the SOM learning.
The classification of learning described above is described H. Aso, H. Murata, and N. Murata in the book entitled “pattern ninshiki to gakushu no toukeigaku (Pattern Recognition and Statistics of Learning)” published by Iwanami Shoten. The SOM learning is described in a paper entitled “Self-Organizing Feature Maps” authored by T. Kohonen, Springer-Verlag Tokyo. The neural gas algorithm is described by T. M. Martinez, S. G. Berkovich, and K. J. Schulten in the paper entitled “Neural-Gas”, Network for Vector Quantization and its Application to Time-Series Prediction, IEEE Trans. Neural Networks, VOL. 4, No. 4, pp. 558-569, 1999.
The above SOM and neural gas algorithm provides unsupervised learning applicable to a vector as a static signal pattern, namely, data having a fixed length. The SOM and the neural gas algorithm cannot be directly applied to time series data such as voice data because voice data is variable in length and dynamic in signal pattern.
In one proposed technique, a higher dimensional vector is defined by connecting consecutive vector series, and time series vectors as time series data are thus handled as a static signal pattern. Such a technique cannot be directly applied to variable-length time series data, such as voice data.
Techniques of using a recurrent neural network with a feedback circuit attached thereto have been proposed as a method of learning time series data in a self-organizing manner (Japanese Unexamined Patent Application Publications Nos. 4-156610 and 6-231106). A back propagation method widely used in the learning of parameters in the recurrent neural network causes a dramatic increase in the amount of calculation with an increase in the size of the recurrent neural network, and as a result, time required for learning is also substantially prolonged. Applying a single recurrent neural network to the learning of a variety of patterns of time series data, such as voice data, is not an effective way of learning.
An HMM technique is available as one of widely available techniques for pattern recognition of time series data, such as recognizing voice data in voice recognition (as disclosed by Laurence Rabiner, and Biing-Hwang Juang in the book entitled “Fundamentals of Speech Recognition” NTT Advanced Technologies).
HMM is one of state transition probability models having state transitions. As previously discussed, HMM is defined by a state transition probability and an output probability density function at each state. In the HMM technique, statistical characteristics of time series data to be learned are modeled. Mixture of normal distributions is used as the output probability density function defining HMM. Baum-Welch algorithm is widely used to estimate parameters of HMM (the parameters are the state transition probability and the output probability density function).
The HMM technique finds applications in a wide range from isolated word recognition, already put to practical use, to large vocabulary recognition. The HMM learning is typically supervised learning, and as shown in FIG. 1, learning data with a correct-answer label attached thereto is used in learning. The HMM learning for recognizing a word is performed using learning data corresponding to that word (voice data obtained as a result of pronouncement of that word).
The HMM learning is supervised learning, and performing the HMM learning on learning data having no correct-answer label attached thereto is difficult, i.e., unsupervised HMM learning is difficult.
Japanese Unexamined Patent Application Publication No. 2002-311988 discloses an HMM learning technique to minimize the number of categories to maximize an amount of mutual information of voice and video, derived from voice data having no correct-answer label attached thereto and corresponding video data. In accordance with this technique, HMM learning cannot be performed if video corresponding to the voice data is not provided. In strict sense, the disclosed technique is not considered as unsupervised learning.
Another technique is disclosed by S. Yoshizawa, A. Baba, K. Matsunami, Y. Mera, M. Yamada, and K. Shikano in the proceeding entitled “Unsupervised Learning of Phonological Model Using Sufficient Statistics and Distance to Speaker” Technical Report of IEICE (Institute of Electronics, Information and Communication Engineers) SP2000-89, pp. 83-88, 2000.