1. Field of the Invention
This invention relates to a pattern production system for pattern recognition, and more particularly to a standard pattern production system for a speech recognition system which employs a standard pattern and another speech recognition system which employs a HMM of the continuous mixture distribution model type.
2. Description of the Related Art
In recent years, investigations for mechanical recognition of speech patterns have been conducted and various methods have been proposed. A representative one of the methods which is used popularly employs a hidden Markov model (HMM).
As a speech recognition system which employs a HMM, a recognition system for a non-particular speaker by which voice of any person can be discriminated is investigated and developed energetically. In a method which employs a HMM, a transition diagram (Markov model) which includes a small number of states is produced for each word or each phoneme, and it is checked and recognized from which one of the models input voice is produced most likely. Since what is observed then is a spectrum sequence produced by transition and the state itself is not observed, it is called "hidden". For each of the models, occurrence probabilities of a spectrum parameter in various states and transition probabilities between the states are estimated using learning samples in advance, and then, upon recognition, input voice is mapped with the models and that one of the models which is most likely to produce the input voice is selected and outputted as a result of recognition.
In the following, a speech recognition system will be described with reference to FIG. 1 using a HMM as an example. Referring to FIG. 1, the speech recognition system shown includes a standard pattern storage section 103, an input pattern production section 101, a recognition section 102 and a recognition result outputting section 104.
Uttered voice of a speaker inputted to the speech recognition system is inputted to the input pattern production section 101, by which it is subject to such processes as analog to digital conversion and speech analysis so that it is converted into a time sequence of feature vectors for each unit called "frame" having a certain time length. The time sequence of feature vectors is called "input pattern".
The length of a frame is usually approximately 10 ms to 100 ms. A feature vector is obtained by extracting feature amounts of the voice spectrum at the point of time and normally have 10 to 100 dimensions.
HMMs are stored in the standard pattern storage section 103. A HMM is one of models of an information source of voice, and parameters of it can be learned using voice of a speaker. The HMM will be hereinafter described in detail in the description of the recognition section. The HMM is usually prepared for each recognition unit. To the description herein, a phoneme is used as an example to show the operation of the recognition unit.
For example, in a speaker-independent recognition system, speaker-independent HMMs learned using uttered voices of many speakers in advance are used as the HMMs of the standard pattern storage section 103.
Now, it is assumed that 1,000 words are used for an object of recognition. In short, a correct answer of a word is discriminated from within recognition candidates of 1,000 words. In order to recognize a word, HMMs of individual phonemes are connected to produce HMMs of recognition candidate words. For recognition of 1,000 words, word HMMs for 1,000 words are produced.
The recognition section 102 performs recognition of the input pattern using the word HMMS. A HMM is a model of an information source of voice, and in order to cope with various fluctuations of a speech pattern, a statistical idea is introduced in a description of a standard pattern. It is to be noted that, for detailed description of the HMM, reference is made to a publication written by Rabiner and Juang, "Fundamentals of Speech Recognition", Prentice Hall, pp. 321-389 (document 1).
A HMM of each phoneme is formed usually from one to ten states and state transitions between the states.
Usually, a start state and an end state are defined, and for each unit time, a symbol is outputted from each state and state transition that takes place.
Voice of each phoneme is represented as a time sequence of symbols outputted from a HMM within state transition from a start state to an end state. An output probability of a symbol is defined for each state, and a transition probability is defined for each transition between states.
The transition probability parameter is a parameter for representing a fluctuation in time of a voice pattern. The output probability parameter represents a fluctuation of voice of a voice pattern. By setting the probability of a start state to a certain value and successively multiplying the output probability and the transition probability for each state transition, a probability in which an utterance is generated from the model can be calculated.
On the contrary, when an utterance is observed, if it is assumed that the utterance is generated from a certain HMM, then an occurrence probability of it can be calculated.
Then, in speech recognition based on HMMs, HMMs are prepared for individual recognition candidates, and if an utterance is inputted, then an occurrence probability for each HMM is calculated, and that one of the HMMs which exhibits a maximum occurrence probability is determined as a generation source and the recognition candidate corresponding to the HMM is determined as a result of recognition.
For the output probability parameter, a discrete probability distribution representation and a continuous probability distribution representation are available, and here, the continuous probability representation is taken as an example. In the continuous probability distribution representation, a continuous mixture distribution, that is, a distribution wherein a plurality of Gaussian distributions are added with weights, is used. In the following example, the output probability has a continuous mixture probability distribution. Such parameters as the output probability parameters, the transition probability parameters and the weight coefficients for the plurality of Gaussian distribution are learned in advance by an algorithm called Baum-Welch algorithm applying learning uttered voice corresponding to a model.
In the following, processing upon word recognition will be described using numerical expressions, and then, learning of parameters will be described.
First, processing upon recognition will be described. An input pattern O represented as a time sequence of characteristic vectors is given as the following expression (1): EQU O=o.sub.1, o.sub.2, . . . , o.sub.t, . . . , o.sub.T ( 1)
where T is a total frame number of the input pattern. Recognition candidate words are represented by W.sub.1, W.sub.2, . . . , W.sub.N where N is a recognition candidate word number.
Matching between a word HMM of each of the words W.sub.n and the input pattern O is performed in the following manner. In the following description, the suffix n is omitted unless required.
First, of a word HMM, the transition probability from a state j to another state i is represented by a.sub.ji, the mixture weight of the output probability distribution is represented by c.sub.im, the average vector of each factor Gaussian distribution is represented by .mu..sub.im, and the covariance matrix is represented by .SIGMA..sub.im, where t is an input time, i and j are states of the HMM, and m is a mixture factor number. Recurrence formula calculation of the following expressions (2) and (3) regarding the forward probability .alpha..sub.t (i) is performed. This forward probability .alpha..sub.t (i) is a probability in which the state i is exhibited at the time t when a partial observation sequence o.sub.1, o.sub.2, . . . , o.sub.t is outputted. EQU .alpha..sub.1 (i)=.pi..sub.i i=1, . . . , I (2) ##EQU1## where .pi..sub.i is a probability in which the initial state is i, and b.sub.i (o.sub.t) is defined by the following expression (4): ##EQU2## where K is a dimension number of an input frame and an average vector.
A likelihood of the word W.sub.n with respect to the input pattern is calculated by the following expression (6): EQU P.sup.n (X)=.alpha..sub.T (I) (6)
where I is the last state. This processing is performed for each word model, and the recognition result word
Wn
with respect to the input pattern X is given by the following expression (7): EQU n=argmax.sub.n P.sup.n (X) (7)
The recognition result word is sent to the recognition result outputting section 104. The recognition result outputting section 104 performs such processing as outputting of the recognition result on a screen or sending of a control command corresponding to the recognition result to another apparatus.
Subsequently, learning will be described. First, the following backward probability is introduced. ##EQU3## where .beta..sub.T (i) is the probability of a partial observation sequence from the time t+1 to the last end when the time t and the state i are given. Using the forward probability and the backward probability, the probability .gamma..sub.t (i) in which the state i is exhibited at the time t when the observation series O is given by the following expression (10): ##EQU4##
Further, the probability .xi..sub.t (i, j) in which the state i is exhibited at the time t and the state j is exhibited at the time t+1 is given by the following expression (11): ##EQU5##
Meanwhile, in the case of the continuous output probability distribution, the probability .gamma.'.sub.t (i, k) in which the kth mixture factor of the state number i is exhibited at the time t is given by the following expression (12): ##EQU6##
Based on the foregoing calculation values, estimated values of .pi., a, c, .mu. and .SIGMA. are given by the following expressions (13) to (17), respectively: ##EQU7##
In the Baum-Welch algorithm, it is repeated to update the parameters based on those estimated values and then estimate estimated values using the parameters. It has been proved that the probability of the observation series increases for each repeat.
The speech recognition apparatus is described above taking a HMM as an example.
While a case wherein a standard pattern is produced for each monophone unit as a speaker-independent HMM is described here, various other units may possibly be used such as a demi-syllable unit or a triphone unit.
The demi-syllable unit is a unit formed from a half when a syllable is cut into two, and the triphone unit is a phoneme unit formed from a phoneme when both of phonemes directly preceding and following the phoneme in uttered voice are taken into consideration. While, for example, "" is represented as "kotoba", since phonemes preceding and following the first /o/ and the second /o/ are different from each other in that the first /o/ is preceded by /k/ and followed by /t/ and the second /o/ is preceded by /t/ and followed by /b/, the first /o/ and the second /o/ are regarded as different phonemes from each other and different standard patterns are produced for them.
An ordinary Japanese language presents 30 to 50 different monophone units, approximately 260 different demi-syllable units and 3,000 to 5,000 different triphone units. If a sufficient amount of uttered voice data for learning is available, then as the number of unit kinds increases, the recognition performance increases.
However, for learning of speaker-independent HMMs, uttered voice of many speakers is normally required, and uttered voice is not necessarily obtained by a sufficient amount.
In particular, where, for example, a demi-syllable is used as a recognition unit, each recognition unit is accompanied by four states, and each state is accompanied by two factor Gaussian distributions. However, in order to obtain a sufficient recognition performance, utterances of approximately 250 words from 85 speakers are required.
Where a comparatively small number of different units are used, a comparatively small amount of utterances for learning may be required, but where a comparatively large number of different units are used, a comparatively large amount of utterances for learning is required. Where the amount of utterance for learning is excessively small with respect to the number of units, such a phenomenon that parameter estimation becomes unstable or a parameter which has no corresponding learning data appears occurs, and the recognition performance is degraded.
As described above, in the conventional method, a large amount of speech data are used for learning of speaker-independent HMMs. However, it is the real condition that a criterion for responding, where a certain amount of speech data for learning is given, to a question of what is a suitable number of different kinds of recognition units or what amount is required for learning for a given number of different kinds of recognition units has not been proposed or realized until now.
Therefore, a technique like a trial and error method wherein a recognition evaluation experiment is conducted using test data varying the recognition unit and an optimum recognition unit is produced based on a result of the recognition evaluation experiment is used conventionally.
However, the conventional method requires preparation of a sufficient amount of speech data for testing in addition to speech data for learning and requires much calculation time for production of standard patterns and repetition of recognition experiments.