1. Field of the Invention
The present invention relates to a method and an apparatus for training Hidden Markov Model (HMM) used in speech recognition.
2. Description of the Related Art
As one of speech recognition techniques, a method for recognizing speech based on the Hidden Markov Model (HMM) (will be referred to as an "HMM method" hereinafter) has been known in the field.
Now, the speech recognition executed by this HMM method will be summarized with reference to FIG. 3.
In the HMM method, for instance, a Hidden Markov Model (HMM) is prepared for each word so as to recognize speech. That is, a model (HMM) is prepared for a word, and this HMM is constituted by a finite set of states (indicated as "S1" to "S4" in FIG. 3), a set of state transition probabilities (symbols a11, a12, . . ., aij denote transit probability from state Si to state Sj), a set of output probability distributions, and so on. It should be noted that the general form of the output probability distribution is expressed as "bi(y)". In the case that an acoustic parameter itself such as the LPC cepstrum is used, there are many cases that as the parameter for defining the output probability distribution, as illustrated in FIG. 3, the signals outputted when the transition occurs from the state Si are expressed by the Gaussian distribution (e.g., (ul, .sigma.l ), (u2, .sigma.2) . . . ).
Subsequently, the HMM training is carried out by employing the training data to define the respective HMMs in such a manner that the probability at which a word of interest is produced becomes maximum. In other words, the transition probability among the states and the output probability distributions for the respective states are determined in such a manner that the probability at which the word of interest is produced becomes maximum. Then, when the speech recognition is actually executed, such an HMM that the probability at which the observation result of the pronounced sound of the word is produced becomes maximum is specified, and then the word corresponding to the specified HMM is derived as the recognition result.
It should be noted that the above-described word HMM is described in, for example, The Bell System Technical Journal, 62, 4, 1983 Apr., pages 1053 to 1074, and Electronic Information Institute in Japan, July 1988, pages 55 to 61.
A confirmation is established such that the speech recognition with employment of such a word HMM can have a higher recognition precision. However, since no speech recognition can be done as to the words other than the words for which the HMMs have been prepared in this word HMM, when such a speech recognizing apparatus containing a large number of vocabularies to be recognized is realized, both a large amount of training data and the plural HMMs whose number is equal to that of the above-described vocabularies must be prepared.
To this end, it is conceivable that an HMM is prepared for each of phonemes so as to recognize a word. However, speech would be strongly influenced by a phoneme context called as "articulation coupling". In other words, speech sometimes contains phonemes which are discriminatable as an "allophone" in view of phonetics, although these phonemes are identical to each other in view of phonemics. It is difficult as a practical matter to represent as the phoneme HMM, such a phoneme whose phoneme pattern distribution to be observed is widened. Therefore, such a high recognition precision as achieved in the word HMM could not be achieved in the speech recognition using the phoneme HMM.
On the other hand, in order to remove the adverse influences caused by the articulation coupling, use of the diphone HMM and the triphone HMM has been proposed, which correspond to a phoneme model depending on forward/backward phonemics environments. In the diphone HMM, a model is prepared for every two phonemes, whereas in the triphone HMM, a model is prepared for every three phonemes. If these phonemics environments, depending type phoneme HMMs are employed, then the adverse influences caused by the articulation coupling can be removed. Accordingly, this phoneme HMM speech recognition can have higher recognition precision, as compared with the phoneme HMM without considering the phonemics environments. However, since a further large number of models should be prepared, plenty of training data are required so as to train these models.
To this end, the technical report on The Telecommunication Institute in Japan, SP 95-21, pages 23 to 30 has described such a proposal that the states of the triphone HMMs are shared in order to train the HMM with high precision while using a relatively small quantity of training data. According to this technical idea, the states of the triphone HMMs are clustered, and the centroid state (SA, SB) indicative of the state belonging to each cluster (A, B) is calculated, as schematically illustrated in FIG. 4. Then, the HMM training is carried out by employing this centroid state.
In accordance with this proposed technique, the HMM training can be performed with high precision while using a relatively small amount of training data. However, as indicated by the state SA1 in FIG. 4, in such a case that there is a state equal to the state belonging to the cluster A (namely, close to centroid state SA) and also close to the centroid state SB with respect to the cluster B, there is a problem that the recognition precision achieved when the HMM obtained by the training would deteriorate.
To solve the problem, the Applicant has proposed the multi sharing of states (Japanese Patent Application No. 7-34062). That is, as illustrated in FIG. 5, the multi-sharing of states is expressed by the linear combination of of centroid states. In accordance with this multi sharing of states method, since the states can be correctly represented, as compared with those of the single sharing, it is possible to obtain the HMM with high recognition precision.
However, in the HMM training method which employs such a state sharing, the state sharing is performed for all of the triphone HMM. As a consequence, there is a risk that the recognition precision of the triphone HMM which has been sufficiently trained is lowered. That is, in such a case that a sufficiently large amount of training data is prepared for a portion of the triphone HMMs, and a sufficiently large amount of training data are not prepared for other triphone HMMs, if state sharing is used, then there is a risk that the above-described effect to be expected by employing the state sharing, could not be achieved.