1. Field of the Invention
The present invention relates to a method of preparing a speech model using a continuous distribution type HMM (Hidden Markov Model) and a speech recognition apparatus using this method. More particularly, this invention relates to a speech model preparing method capable of preparing the HMM of a new input speech with a very few number of utterances like one or two times, and a speech recognition apparatus using this method.
2. Description of Background Information
There is a probability-model based speech recognition apparatus which uses an HMM. This HMM is a Markov model which has a plurality of statuses linked with a state transition probability and is characterized by outputting a predetermined symbol in accordance with a symbol output probability when the status changes from one state to another. Generally, speech recognition uses a left-to-right model which does not involve reverse time sequence.
This speech recognition apparatus using such an HMM is designed to prepare HMMs for all the words to be recognized and register them in a dictionary in advance. At the time of speech recognition, the speech recognition apparatus sequentially reads HMMs from the dictionary, computes the probability (likelihood) of outputting the same observing symbol series as the input speech for each HMM, and outputs a word corresponding to the HMM which gives the highest probability, as the recognition result.
FIG. 1 exemplifies an HMM, which can output two symbols a and b and has three internal statuses S.sub.1 to S.sub.3. The status starts from S.sub.1 and ends at S.sub.3. The lines with arrowheads connecting the individual statuses S.sub.1 -S.sub.3 represent the state transitions, and a.sub.ij along each arrow line indicates the state transition probability while c.sub.ij indicates the symbol output probability then. The upper element in the parentheses "!" in the symbol output probability c.sub.ij is the output probability of the symbol a, and the lower element is the output probability of the symbol b.
Given that the observing symbol series of the input speech is (aab), the probability (likelihood) that the HMM in FIG. 1 outputs this observing symbol series (aab) is computed as follows (see "Markov Model Based Voice Recognition", by Masaaki Okouchi, Journal of Electronic Information Communication Society of Japan, April 1987, for example).
The syllables in the observing symbol series (aab) have three lengths /a/a/b/, so that the change or transition paths which allow the observing symbol series (aab) to be output for the HMM in FIG. 1 are limited to three routes: S.sub.1 .fwdarw.S.sub.1 .fwdarw.S.sub.2 .fwdarw.S.sub.3, S.sub.1 .fwdarw.S.sub.2 .fwdarw.S.sub.2 .fwdarw.S.sub.3, and S.sub.1 .fwdarw.S.sub.1 .fwdarw.S.sub.1 .fwdarw.S.sub.3.
Because the probability that the observing symbol series (aab) is output for each transition path is expressed by all the products of the state transition probability a.sub.ij and the symbol output probability c.sub.ij along that transition path, the probabilities for the three transition paths take the following values.
For S.sub.1 .fwdarw.S.sub.2 .fwdarw.S.sub.2 .fwdarw.S.sub.3, EQU 0.3.times.0.8.times.0.5.times.1.0.times.0.6.times.0.6=0.036
For S.sub.1 .fwdarw.S.sub.2 .fwdarw.S.sub.2 .fwdarw.S.sub.3, EQU 0.5.times.1.0.times.0.4.times.0.3.times.0.6.times.0.5=0.018
For S.sub.1 .fwdarw.S.sub.1 .fwdarw.S.sub.1 .fwdarw.S.sub.3, EQU 0.3.times.0.8.times.0.3.times.0.8.times.0.2.times.1.0=0.01152
Since any of the three transition paths can output the observing symbol series (aab), the sum of those three probabilities, 0.036+0.018+0.01152=0.06552, becomes the probability (likelihood) of outputting the observing symbol series (aab) for the HMM in FIG. 1. For simplicity, the maximum value, "0.036", in the computed three probabilities may be treated as the probability for that HMM.
The input speech can be recognized by performing this probability computation for all the HMMs registered in the dictionary and outputting a word corresponding to the HMM that gives the highest value in the computed probabilities as the recognition result.
As shown in FIGS. 2A and 2B, there are a discrete distribution type HMM for which the symbol output probability c.sub.ij changes discontinuously and a continuous distribution type HMM for which the symbol output probability c.sub.ij changes continuously. Because discrete distribution type HMMs are accompanied with a quantization error, continuous distribution type HMMs are frequently used in speech recognition.
It is apparent from FIG. 2B that the symbol output probability c.sub.ij for the continuous distribution type is defined by the average vector .mu. and a variance .SIGMA. of a symbol. Therefore, a continuous distribution type HMM is entirely described by three parameters: the state transition probability a.sub.ij (see FIG. 1), the average vector .mu. and the variance .SIGMA..
To prepare an HMM for each registered word, learning should be performed using many samples in a population representing a speech model to predict the associated three parameters. As the method for this prediction, there are known several algorithms which include forward and backward algorithms.
With regard to the computation of the probability that an HMM exemplified in FIG. 1 outputs the observing symbol series (aab), there are likewise known several algorithms which include the forward algorithm and Viterbi algorithm.
As the HMM-based speech recognition apparatus executes speech recognition using the above-discussed probability scheme, it is excellent as a speech recognition apparatus for unspecific speakers and is being adapted in various fields, such as a voice command system in a vehicular navigation apparatus.
However, the current HMM-based speech recognition apparatus is not yet complete and causes a recognition error or recognition failure when words not registered in the dictionary are input or when uttered words, even if registered in the dictionary, are very different from standard patterns.
When a recognition error or recognition failure occurs, it is necessary to prepare a new HMM for that input speech and add it in the dictionary in order to enable the recognition of that input speech next time. According to the conventional learning method, however, a single word should be uttered ten to twenty times to predict the three parameters, namely the state transition probability, average vector and variance, and prepare a new HMM for the input speech. This method requires a great deal of efforts and time to register each additional word.
If the registration of additional words takes a lot of efforts and time, it is difficult to learn while an automobile or the like are running so that the vehicle should be stopped every time additional registration is needed. When a user is in a hurry, additional registration is carried out long after, so that the user may forget the input speech (word) to be additionally registered or may even forget the task of adding the word itself.