There is a conventional word speech recognition apparatus that deals with unnecessary words that do not need translation, using a garbage acoustic model that has learned a collection of unnecessary words (refer to “The Processing Method of Unnecessary Words in Free Conversational Sentences Using the Garbage HMM” by Naoki Inoue and other two, Academic Journal of Electronic Information and Communication A, Vol. J77-A, No. 2, pp. 215-222, February, 1994).
FIG. 1 shows a structure of the conventional speech recognition apparatus;
As FIG. 1 shows, the conventional speech recognition apparatus comprises: the feature value calculation unit 1201, the network dictionary storage unit 1202, the path calculation unit 1203, the path candidate storage unit 1204, the recognition result output unit 1205, the language model storage unit 1206, the language score calculation unit 1207, the word acoustic model storage unit 1208, the word acoustic score calculation unit 1209, the garbage acoustic model storage unit 1210, and the garbage acoustic score calculation unit 1211.
The feature value calculation unit 1201 analyzes the unidentified input speech, and calculates the feature parameter necessary for recognition. The network dictionary storage unit 1202 stores the network dictionary wherein the list of the words that the speech recognition apparatus can accept is recorded. The path calculation unit 1203 calculates the cumulative score of the path for finding the optimum word string of the unidentified input speech, using the record of the network dictionary. The path candidate storage unit 1204 stores the information of the path candidate. The recognition result output unit 1205 outputs the word string whose final score is the highest as the recognition result.
Also, the language model storage unit 1206 stores the language model that has statistically learned the probability of the appearing words in advance. The language score calculation unit 1207 calculates the language score which is the probability of the word appearing in link with the previous word. The word acoustic model storage unit 1208 stores, in advance, the word acoustic model corresponding to the recognition subject vocabulary. The word acoustic score calculation unit 1209 calculates the word acoustic score by comparing the feature parameter and the word acoustic model.
In addition, the garbage acoustic model storage unit 1210 stores the garbage acoustic model that has learned a collection of unnecessary words that do not need translation such as “Ehmm” and “Uhmm”. The garbage acoustic score calculation unit 1211 calculates the garbage acoustic score which is the appearing probability of the unnecessary words (the garbage model) by comparing the feature parameter and the garbage acoustic model.
Next, the operations performed by each unit of the conventional speech recognition apparatus will be explained as following.
First, the unidentified input speech a user has uttered is inputted into the feature value calculation unit 1201. Then, the feature value calculation unit 1201 calculates the feature parameter by analyzing the speech of each frame that is a time unit for speech analysis. Here, the length of a frame is 10 ms.
The path calculation unit 1203 refers to the network dictionary, where the acceptable word connections are recorded, stored in the network dictionary storage unit 1202. Then, the path calculation unit 1203 calculates the cumulative score of the path candidate to the corresponding frame, and registers the information of the path candidate in the path candidate storage unit 1204.
FIG. 2 shows the path candidate in the case where the input speeches are “Sore wa, da, dare”. FIG. 2(a) shows the input speeches with the words separation. FIG. 2(b) shows the path candidate in the case where the input frame is “t-1”. FIG. 2(c) shows the path candidate in the case where the input frame is “t”. The horizontal axis shows the frames. Here, the unnecessary stuttering word, “da”, is recognized as a garbage model. The garbage model is provided with a path in the same way as a word.
Here, the paths 511, 512, 513 and 52 are the non-optimum paths to some midpoint of the words. The paths 521 and 522 are the optimum paths that have reached the end of the words. The paths 531 and 532 are the non-optimum paths that have reached the end of the words. The path 54 is the optimum path to some midpoint of the word.
The path calculation unit 1203 calculates the cumulative score for each path by extending the paths from each path candidate of the frame which precedes the corresponding frame by one frame.
FIG. 2(b) shows the path candidates in the “t-1” frame which is the frame preceding the corresponding frame “t” by one frame. This information of the path candidates is stored in the path candidate storage unit 1204. The paths are extended from these path candidates as shown in the frame “t” of FIG. 2(c). Some paths extend the words of the path candidates in the preceding frame; others finish the words of the path candidates in the preceding frame, and start the new words connectable to the previous words. Here, the connectable words are the ones recorded in the network dictionary.
In FIG. 2(b) in the frame “t-1”, there are (i) the word, “wata”, of the non-optimum path 511 that is at some midpoint of the word, and (ii) the word, “wata”, of the optimum path 521 that has reached the end of the word. In FIG. 2(c) in the frame “t”, (i) the word, “wata”, of the non-optimum path 511 is further extended, (ii) the word, “wata”, of the optimum path 521 is connected to the word, “tane”, of the optimum path 54 that is at some midpoint of the word, and also to the word, “gashi” of the non-optimum path 512 that is at some midpoint of the word.
Next, the language score and the acoustic score are calculated for each of the extended path candidates.
The language score is calculated by the language score calculation unit 1207 using the language model stored in the language model storage unit 1206. As the language score, the logarithm value of the bigram probability is used, said bigram probability being the probability of the words that connect to the previous words. Here, in the optimum path 522 that has reached the end of the word, wherein “wata” connects to “sore”, the appearance probability of “wata” after “sore” is used. The language score is calculated per one word.
The acoustic score is calculated in relation to the input feature parameter vector (i) by the word acoustic score calculation unit 1209 using the word acoustic model stored in the word acoustic model storage unit 1208, in the case where the corresponding path candidate is a word and (ii) by the garbage acoustic score calculation unit 1211 using the garbage acoustic model stored in the garbage acoustic model storage unit 1210, in the case where the corresponding path candidate is an unnecessary word (a garbage model).
For instance, in FIG. 2(b) in the frame “t-1”, the paths for calculating the acoustic score are the four paths. The paths which use the word acoustic model are: “wata” of the path 511 connecting to “sore” of the path 522, “wata” of the path 521 connecting to “sore” of the path 522, and “dare” of the path 513 connecting to “wa” of the path 531. The path which uses the garbage acoustic model is “the garbage model” of the path 532 connecting to “wa” of the path 531.
As the acoustic model, in general, the hidden Markov model (HMM) which has stochastically modeled the acoustic features is used. The HMM that represents the acoustic features of words is called the word acoustic model. The HMM that represents a collection of the acoustic features of the unnecessary words that do not need translation, such as “Ehmm” and “Uhmm”, as one model is called the garbage acoustic model. The word acoustic score and the garbage acoustic score are the logarithm values of the probability acquired from the HMM, and show the appearance probability of the word and the garbage models.
The language score and the acoustic score acquired, as described above, are combined as a comparative score, and the cumulative score of each path is calculated by the Viterbi algorithm (refer to “Speech Recognition by the Probability Model” by Seiichi Nakagawa, edited by the Association of the Electronic Information and Communications, pp. 44-46, first published in 1988).
However, it is not preferable to simply record all of the extended path candidates because the amount of calculation and the amount of memory increase enormously. Therefore, a beam search, which leaves only “K” (“K” is a natural number) extended path candidates in the order of the high cumulative score for each frame, is used. The information of the “K” path candidates is registered in the path candidate storage unit 1204.
The processes as described above are repeated forwarding the input frame by one.
Finally, after all the frame processing is completed, the recognition result output unit 1205 outputs the word string of the path candidates as the recognition result, said word string having the highest cumulative score among the candidate paths stored in the candidate path storage unit 1204.
However, in such conventional example as described above, there is a problem that the speech recognition apparatus makes a wrong recognition in the case where there is a word string acoustically similar to the non-language speech such as a stuttering speech.
Here, stuttering speech means an unfluent speech production in which the first speech or the middle speech is clogged, the same speeches are repeated many times, and some speeches are stretched.
In FIG. 2(c) the number in the parentheses above each word is the comparative score for each word.
In FIG. 2(c) the right answer is that the stuttering part of the unidentified input speech, “da”, passes through the garbage model, and the path 52 connecting “dare” to “da” becomes the optimum path in the frame “t”. In the case of “sore”+“wata”, 7+10=17 points, in the case of “sore”+“wata”+“tane”, 7+9+2=18 points, in the case of “sore”+“wata”+“gashi”, 7+9+1=17 points, in the case of “sore”+“wa”+“dare”, 7+5+4=16 points, and in the case of “sore”+“wa”+the garbage model+“dare”, 7+5+2+1=15 points. Thus, the word string of “sore”+“wata”+“tane” has the highest score in the frame.
The reason for the result described above is as following. The garbage acoustic model learns from all of the acoustic data which has the unnecessary words including stuttering speeches. Such large distribution of the unnecessary words prevents the speech production of unnecessary words, that is, non-language speeches from acquiring high acoustic scores.
In order to solve this problem, there is a method to boost all of the garbage acoustic scores. However, with this method, the garbage acoustic score is high even in the frame where the optimum path does not have unnecessary words. Therefore, the speech recognition apparatus makes a wrong recognition.
An object of the present invention, in view of the above problem, is to provide a speech recognition apparatus that can correctly recognize unidentified input speeches even if they include unnecessary words, in particular, non-language speeches such as stuttering speeches.