Generally, technology for recognizing fixed words common to unspecified users is known as speaker independent speech recognition. In speaker independent speech recognition, information about feature parameters of fixed words common to unspecified users is accumulated in a storage unit such as a ROM.
Known methods for converting speech samples into a feature parameter sequence include cepstrum analysis and linear predictive analysis. Methods employing hidden Markov models are generally used to prepare information (data) about feature parameters of fixed words common to unspecified users and compare the information with the feature parameter sequence converted from input speech.
Speaker independent speech recognition by means of hidden Markov models is described in detail in “Digital Signal Processing for Speech and Sound Information” (by Kiyohiro Shikano, Tetsu Nakamura, and Shiro Ise (Shokodo, Ltd.)).
For example, in the case of the Japanese language, a phoneme set described in Chapter 2 of “Digital Signal Processing for Speech and Sound Information” is used as a speech unit and each phoneme is modeled using a hidden Markov model. FIG. 6 shows a list of phoneme set labels. The word “Hokkaido,” for example, may be modeled using a network (sequence of fixed-word labels) of phoneme labels common to speakers.
If the sequence of fixed-word labels shown in FIG. 7(A) and phoneme model data based on corresponding hidden Markov models as shown in FIG. 7(B) are provided, those skilled in the art can easily construct a speaker independent speech recognition device using the Viterbi algorithm described in Chapter 4 of “Digital Signal Processing for Speech and Sound Information.”
In FIG. 7(B), a(I, J) represents the transition probability of transition from state I to state J. For example, a(1, 1) in the figure represents the transition probability of transition from state 1 to state 1. Also, b(I, x) represents an output probability of state I given acoustic parameter (feature parameter) x. Thus, b(1, x) in the figure represents the output probability of state 1 when acoustic parameter x is detected.
In FIG. 7(B), pI represents the probability of state I and is updated according to Equation (1) below.pI=max(p(I−1)×a(I−1, I), pI×a(I, I))×b(I, X)  (1)
Incidentally, max( ) on the right side of Equation (1) means that the largest product is selected from among the products in max( ). The same applies hereinafter.
Next, an overall flow of speech recognition using the above-mentioned hidden Markov models common to both males and females will be described with reference to FIG. 8.
First, feature parameters are detected in (extracted from) a speech signal. Occurrence probabilities of the feature parameter sequence are calculated using Equation (1) for each of the common hidden Markov models for both males and females. The common hidden Markov models, M1, M2, . . . Mn are determined in advance of the speech recognition process. The highest probability is selected from the calculated occurrence probabilities. The input speech is recognized by selecting the phoneme label sequence having the highest occurrence probability.
Acoustic conditions generally differ between adult males and females due to difference in vocal-tract length. Thus, in a method (multi-template) sometimes used to improve speech recognition rates, an acoustic model for males and an acoustic model for females are prepared separately, as shown in FIG. 9(A), using male voice data and female voice data and then hidden Markov model state sequences which compose a vocabulary to be recognized when spoken are prepared for males and females as shown in FIG. 9(B).
In FIG. 9(B), a(I, J) represents the transition probability of a model for females transitioning from state I to state J while A(I, J) represents the transition probability of a model for males transitioning from state I to state J. Also, b(I, x) represents an output probability in state I when acoustic parameter x of the model for females is obtained while B(I, x) represents an output probability in state I when acoustic parameter x of the model for males is obtained.
In FIG. 9(B), pI represents the probability of state I of the model for females and is updated according to Equation (2) below.pI=max(p(I−1)×a(I−1, I), pI×a(I, I))×b(I, X)  (2)
Also in FIG. 9(B), PI represents the probability of state I of the model for males and is updated according to Equation (3) below.PI=max(P(I−1)×A(I−1, I), PI×A(I, I))×B(I, X)  (3)
Next, an overall flow of speech recognition using the above-mentioned two types of hidden Markov models, hidden Markov models for males and females, will be described with reference to FIG. 10.
First, feature parameters are detected in (extracted from) a speech signal. Next, with reference to the detected feature parameters, hidden Markov models (words) Ma1, Ma2, . . . Man for males determined in advance, and hidden Markov models (words) Mb1, Mb2, . . . Mbn for females determined in advance, occurrence probabilities of the feature parameter sequence are calculated using Equations (2) and (3). Then, the highest probability is selected from the calculated probabilities and the phoneme label sequence which gives the highest probability is obtained as a recognition result of the input speech.
In this case, the speech recognition rate is improved compared to a single acoustic model (hidden Markov model) prepared from male voice data and female voice data. The memory, however, used to compose a vocabulary doubles when compared to the common model for both males and females. In addition, the memory used to hold information about probabilities of various states also increases when gender specific Markov models are used.
As described above, the use of multi-template, gender-specific acoustic models for speaker independent speech recognition improves the speech recognition rate compared to when one acoustic model is prepared from male voice data and female voice data, but introduction of the multi-template practically doubles the vocabulary, resulting in increased memory usage.
Recently, there has been growing demand for speech recognition on application programs from an increasingly wider range of age groups, and a high speech recognition rate is desired irrespective of age groups. Thus, it is conceivable that separate acoustic models for adult males, adult females, children of elementary school age and younger, aged males and aged females may be used in the future. In such a situation, the vocabulary may increase by a factor of five, further increasing memory requirements.
The larger the vocabulary, the more serious the increase in memory requirements will be. The increased memory requirements for the larger vocabulary creates a large cost (production cost) disadvantage, for example, when incorporating speech recognition into a portable telephone. Thus, it is desired to curb increases in memory requirements and reduce production costs while improving speech recognition rates using multiple acoustic models.
Incidentally, even when a common acoustic model for both males and females is used, some single vocabulary item (word) is treated as two vocabulary items if it has different colloquial expressions. For example, the word “Hokkaido” may be pronounced in two ways. “hotskaidou” and “hotskaidoo.” This can be solved using the Viterbi algorithm as shown in FIG. 11.
In FIG. 11(B), au(I, J) represents the transition probability of the phoneme u transitioning from state I to state J while ao(I, J) represents the transition probability of the phoneme o transitioning from state I to state J. Also, bu(I, x) represents an output probability in state I when acoustic parameter x of the phoneme u is obtained while bo(I, x) represents an output probability in state I when acoustic parameter x of the phoneme o is obtained.
In FIG. 11(B), ul represents the probability of state I of the phoneme u and is updated according to Equation (4) below.uI=max(u(I−1)×au(I−1, I), ul×au(I, I))×bu(I, X)  (4)
Also in FIG. 11(B), ol represents the probability of state I of the phoneme o and is updated according to Equation (5) below.oI=max(o(I−1)×ao(I−1, I), ol×ao(I, I))×bo(I, X)  (5)
Again, memory requirements increase as in the case where multi-template, gender-specific acoustic models are used.
Thus, an object of the present invention is to provide a speech recognition device and speech recognition method that can improve the accuracy of speech recognition rates without substantially increasing the memory requirements of working memory or the like for speech recognition.