The present invention relates to a pattern recognition system and, more particularly, to a pattern adaptation system for adapting "a reference pattern" constituting a plurality of different categories using "an input pattern" as an aggregate of input samples. It is presently understood that the best field of utilization of the present invention is the speaker adaptation system in a speech recognition system. This system is based on a Hidden Marcov model (HMM) of a mixed continuous distribution model type or the like in which the reference pattern output probability distribution is a mixed Gaussian distribution.
Recently, research and investigations concerning mechanical recognition of speech patterns have been made, and various methods (i.e., speech recognition methods) have been proposed. One typical method that is extensively applied is based on a method called dynamic programming (DP) matching.
Particularly, in the field of speech recognition systems using HMM, speaker-independent speech recognition systems that are capable of recognition of the speech of any person, have recently been extensively studied and developed.
The speaker-independent type of recognition system has an advantage over the speaker-dependent type of recognition system, where the speaker-dependent type is used by a definite user, because the user of a speaker-independent type need not register any speech in advance. However, the following problems in the speaker-independent recognition system are pointed out. A first problem is that the speaker-independent system is inferior to the speaker-dependent system for almost all speakers. A second problem is that the speaker-independent recognition performance is greatly deteriorated for some "particular speakers" (i.e., unique speakers).
In order to solve these problems, research and investigations have recently been started, which concern the application of the speaker adaptation techniques that are used mainly in speaker-dependent systems to speaker-independent systems as well. The speaker adaptation techniques have a concept of adapting a speech recognition system to new users (i.e., unknown speakers) by using a lesser amount of adaptation data than is used for the initial training. The speaker adaptation techniques are detailed in Sadaoki Furui, "Speaker Adaptation Techniques in Speech Recognition", Television Study Association, Vol. 43, No. 9, 1989, pp. 929-934.
Speaker adaptation can be classified into two methods. One is "supervised speaker adaptation," and the other is "unsupervised speaker adaptation." Also, it is understood that the "supervised signal" is a vocal sound expression series representing the speech contents of input speech. The "supervised speaker adaptation" thus refers to an adaptation method in the case where the vocal sound expression series for the input speech is unknown, and requires preliminary instruction of speech vocabularies with the unknown speaker for adaptation. The "unsupervised adaptation," on the other hand, is an adaptation method used when the vocal sound expression series for the input speech is known, and requires no limit on the speech contents of input speech to the unknown speaker, i.e., no speech vocabulary has to be instructed with the unknown speaker. Actually, unsupervised adaptation using input speech as the subject of speech recognition can occur without the unknown speaker being aware that the adaptation is being done. Generally, however, the recognition rate based on "unsupervised adaptation" after the adaptation is low as compared to that based on the "supervised adaptation." For this reason, the "supervised adaptation" is presently used frequently.
From the above viewpoint, the need for the speaker adaptation system in the speech recognition system is increasing. The "adaptation" techniques as described are important not only in speech recognition systems but also in pattern recognition systems, the concept of which involves the speech recognition system. The "speaker adaptation system" in the speech recognition system can be generalized as the "pattern adaptation system" in the pattern recognition system.
In the prior art pattern adaptation systems of the type as described, adaptation is executed in the same mode irrespective of whether the number of input samples for adaptation is large or small. Therefore, when the input samples are less in number, then the data amount may be insufficient and deteriorate the accuracy of parameter estimation for the pattern adaptation.
The process of the speech recognition system, which are the most extensive applications of the present invention, will now be described. A speech recognition system using HMM is described as an example, and the speaker adaptation techniques in this speech recognition system will also be mentioned with reference to FIG. 4.
A speaker's speech (i.e., input speech) is supplied to an input pattern generation device 42 for conversion to a feature vector time series for each unit, also called a "frame," having a certain time length through such processes as analog-to-digital conversion and speech analysis. The "feature vector time series" is referred to as an input pattern. The time length of the frame is usually 10 to 100 ms. The feature vectors are obtained by extracting the feature quantity of the speech spectrum at corresponding instants, usually 10-dimensional to 100-dimensional (10-d to 100-d).
HMM's are stored as reference patterns in a reference pattern memory, device 41. The HMM's are speech (sound) information source models, and the HMM parameters may be trained by using input speech. The HMM's will be mentioned in the description of a recognition device 43 given hereunder. The HMM is usually prepared for each recognition unit. Here, the case of where the recognition unit is a sound element is taken as an example. In the speaker-independent recognition system, HMM's are stored in the recognition pattern memory device 41 where the HMM's have been previously obtained for use with an unknown speaker through training of speeches of many speakers.
A case is now assumed, where 1,000 words are the subjects of recognition, that is, a case where a correct answer of one word is obtained among a set of recognition candidates of 1,000 words. For word recognition, HMMs of individual sound elements are coupled together to produce an HMM of a recognition candidate word (word HMM). When 1,000 words are recognized, word HMMs for 1,000 words are produced.
The recognition device 43 recognizes the input pattern using the word HMMs. This "pattern recognition" will now be described. In the HMM, a statistical concept is introduced into the description of the reference pattern to cope with variations of the speech pattern. The HMM is detailed in Seiichi Nakagawa, "Speech Recognition with Probability Models", the Electronic Information Communication Engineer's Association, 1987 (hereinafter referred to as the Nakagawa Literature), pp. 40-44, 55-60 and 69-74.
Each sound element HMM usually comprises 1 to 10 states and inter-state transitions. Usually, the start (i.e., first) and last states are defined, and a symbol is taken out from each state for every unit time for inter-state transition. The speech of each sound element is expressed as a time series of symbols produced from individual states during the inter-state transition interval from the start state to the last state. For each state the symbol appearance probability (output probability) is defined, and for each inter-state transition the transition probability is defined. The HMM thus has an output probability parameter and a transition probability parameter. The output probability parameter represents a "sound color" sway of the speech pattern. The transition probability parameter represents a "time-wise" sway of the speech pattern. The generation probability of speech from the model (i.e., HMM) thereof, can be obtained by setting the start state probability to a certain value and multiplying the value by the output probability and also by the transition probability for each inter-state transition.
Conversely, when a speech element is observed, its generation probability can be calculated by assuming that it is generated from a certain HMM.
In the HMM speech recognition, an HMM is prepared for each recognition candidate, and upon the input of speech the generation probability thereof is obtained in each HMM. The maximum generation probability HMM is determined to be a source of generation, and the recognition candidate corresponding to that HMM is made to be the result of recognition.
The output probability parameter is expressed by a discrete probability distribution expression and a continuous probability distribution expression. Here, the case of where the continuous probability distribution expression is adopted is taken as an example. The continuous probability distribution expression uses a mixed continuous distribution, i.e., a distribution obtained by adding together a plurality of Gaussian distributions with weighting.
The output probability parameter, the transition probability parameter, and such parameters as the weighting of a plurality of Gaussian distributions, are preliminarily given a training speech with respect to a model and trained with an algorithm called the Baum-Welch algorithm. The Baum-Welch algorithm is detailed in the Nakagawa Literature.
The process of the word recognition of the input pattern will now be described mathematically. Input pattern X which is expressed as a time series of feature vectors is given as EQU X=x.sub.1, x.sub.2, . . . , x.sub.t, . . . , x.sub.T (1)
wherein T represents a total number of input patterns x.
Recognition candidate words are denoted by W.sub.1, W.sub.2, . . . , W.sub.n, . . . , W.sub.N. The total number of recognition candidate words is denoted by N. Matching between the word HMM of each word W.sub.n and the input pattern X is made as follows, with the subscripts omitted unless they are needed for clarity. In the word HMM, the transition probability from state j to state i is denoted by .alpha..sub.j1, the mixture weight of the output probability distribution by .lambda..sub.im, each element Gaussian distribution mean vector in the output probability distribution by .mu..sub.im, and the covariance matrix of the output probability distribution by .SIGMA..sub.im. Also, t denotes the instant of input, i and j denote the states of the HMM, and m denotes the mixed element serial number.
The following recurrence formula calculation concerning forward probability a(i, t) is expressed. EQU .alpha.(i, 0)=.pi. (2)
i=1, . . . , I EQU .alpha.(i, t)=.SIGMA..alpha.(j, t-1)a.sub.ji b.sub.i (x.sub.t) (3) PA1 i=1, . . . , I; t=1 . . . , T
wherein .pi..sub.i represents a probability with initial state i, and b.sub.i (x.sub.t) and N(x.sub.t ; .mu..sub.im, .SIGMA..sub.im) are represented by the following formulae: EQU b.sub.i (x.sub.t)=.SIGMA..sub.m.lambda..sub.im N(x.sub.t); .mu..sub.im, .SIGMA..sub.im) (4) EQU N(x.sub.t); .mu..sub.im, .SIGMA..sub.im)=(2.pi.).sup.-n/z.vertline..SIGMA..sub.im.vertline..sup.-1/ 2 EQU exp(-(.mu..sub.im -x.sub.t).SIGMA..sub.im.sup.-i (.mu..sub.im -x.sub.t)/2) (5)
The likelihood P.sup.n (X) for the input pattern W.sub.n is obtainable as: EQU P.sup.n (X)=.alpha.(I, T) (6)
wherein I represents a final state. Through execution of this processing for the word HMM of each word, a recognized word W.sub.n is given as: EQU n=argmax.sub.n P.sup.n (X) (7)
Such recognition result words are supplied from the recognition device 43 to the recognition result output (i.e., output of recognized result) device 44.
A recognition result output device 44 executes, for example, such processes as outputting recognition result words to a display and sending control commands corresponding to recognition result words to different systems or apparatuses. These displays, systems and apparatuses examples ate omitted from the drawings for clarity.
In the speaker adaptation by a speaker adaptation device 45 (see the broken lines with arrows in FIG. 4), the reference pattern in the reference pattern memory device 41 is corrected to provide improvement of the performance with respect to unknown speakers. Specifically, training, using the speaker's speech when the speech recognition system is used, is allowed for the adaptation of the reference pattern to the speaker, thus providing a high recognition rate. In this case, the adaptation process is not changed in dependence on whether the data amount input speech (i.e., number of input samples) are great or less, (i.e., larger or smaller) and a certain number of input samples are necessary for adequate speaker adaptation.
In the prior art pattern adaptation system described above, with a lesser number of input samples the accuracy of the parameter estimation for the pattern adaptation is deteriorated. This deterioration is due to the insufficient data amount, resulting in insufficient effect of the reference pattern adaptation.
For example, in the speaker adaptation system in the speech recognition system, in the case of a very small amount of input speech data, the parameter estimation accuracy is deteriorated due to the insufficient data amount. The result of this insufficient amount is that an adequate effect of the speaker adaptation of the reference pattern cannot be obtained, that is, the recognition performance is not improved.