The present invention relates to a reference pattern learning system in speech recognition based on pattern matching with a reference pattern wherein a plurality of parameters which characterize reference patterns of each category are determined on the basis of a plurality of learning utterance data.
A Hidden Markov Model (to be referred to as an HMM hereinafter) is most popular as a system for recognizing a pattern represented as a feature vector time series of, e.g., speech signals. Details of the HMM are described in "Speech Recognition by Probability Model", Seiichi Nakagawa, the Institute of Electronics and Communication Engineers of Japan, 1988 (to be referred to as Reference 1 hereinafter). Further background on HMM, as well as on dynamic programming matching (herineafter DP matching) is found in "Structural Methods in Automatic Speech Recognition" by Stephen E. Levinson, Proceedings of the IEEE 1985, Vol. 73, No. 11, pp.1625-50 (to be referred to as Reference 2 heinafter. In the HMM, modeling is performed on the basis of an assumption wherein a feature vector time series is generated by a Markov probability process. An EM reference pattern is represented by a plurality of states and transitions between these states. Each state outputs a feature vector in accordance with a predetermined probability density profile, and each transition between the states accompanies a predetermined transition probability. A likelyhood value representing a matching degree between an input pattern and a reference pattern is given by a probability at which a Markov probability model as a reference pattern generates an input pattern vector sequence. An interstate transition probability characterizing each reference pattern and parameters defining a probability density profile function can be determined by a "Baum-Welch algorithm" using a plurality of learning utterance data.
The "Baum-Welch algorithm" as a statistical learning algorithm requires a large volume of learning data to determine model parameters. A new user must utter a lot of speech inputs, resulting in inconvenience and impractical applications. In order to reduce the load on a new user, there are available several speaker adaptive systems for adaptively applying a recognition apparatus to a new speaker by using a relatively small number of utterances by the new speaker. Details of the speaker adaptive system are described in "Speaker Adaptive Techniques for Speech Recognition", Sadaoki Furui, The Journal of the of Television Society, Vol. 43, No. 9, 1989, pp. 929-934 (to be referred to as Reference 3 hereinafter). See also "Speaker Adaptation for Demi-Syllable Based Continuous Density HMM" by Koichi Shinoda et al., Proc. ICASSP 1991, pp. 857-860.
The most important point in the speaker adaptive modeling system is the way of estimating parameters of a model representing an acoustic event not contained in a small number of adaptive utterances by a new user and the way of adaptively modeling using these parameters. In each of the existing speaker adaptive modeling systems, a similarity between acoustic events is basically defined using a physical distance between feature vectors as a measure, parameters of a model representing acoustic events not appearing in the adaptive utterances are estimated on the basis of the similarity, and adaptive modeling is performed using these parameters.
In the existing speaker adaptive modeling systems, by using reference patterns prepared in advance and adaptive utterance data of a new user, a similarity between acoustic events is basically defined using a physical distance between feature vectors as a measure, parameters of a model representing acoustic events not appearing in the adaptive utterances are estimated on the basis of the similarity, and adaptive modeling is performed using these parameters.
In adaptive modeling on the basis of estimation in accordance with the above physical distance, recognition precision can be improved as compared with that prior to adaptive modeling. However, a recognition result is far from recognition performance by reference patterns of a specific speaker which are constituted by a sufficient amount of utterance data, as can be apparent from experiment results described in the above references.