FIG. 1 illustrates a conventional speech recognition apparatus.
A speech uttered by a user is input to a microphone 1 and converted into an electrical speech signal. The speech signal is supplied to an A/D (analog-to-digital) converter 2. The A/D converter 2 samples and quantizes the speech signal supplied from the microphone 1 thereby converting the speech signal into digital speech data. The resultant speech data is supplied to a feature extracting unit 3.
The feature extracting unit 3 performs an acoustic process on the speech data supplied from the A/D converter 2 on a frame-by-frame basis to extract feature values such as MFCC's (Mel-Frequency Cepstrum Coefficients). The resultant extracted feature values are supplied to a matching unit 4. The feature values extracted by the feature extracting unit 3 are not limited to MFCC's, but other types of feature values such as spectra, linear prediction coefficients, cepstrum coefficients, or line spectra may also be extracted.
The matching unit 4 analyzes the feature values supplied from the feature extracting unit 3 while referring to an acoustic model database 5, a dictionary database 6, and a grammar database 7 as required, thereby recognizing the speech input via the microphone 1 using the continuous-distribution HMM method or the like.
The acoustic model database 5 stores an acoustic model representing acoustic characteristics of respective phonemes and syllables of the speech, in a particular language, to be recognized. When the speech recognition is performed using the continuous-distribution HMM algorithm, the acoustic model based on the HMM (Hidden Markov Model) is used. The dictionary database 6 stores a word dictionary describing information about pronunciations of the respective words (vocabulary) to be recognized. The grammar database 7 stores a grammar (language model) describing how the respective words stored in the word dictionary 6 can be concatenated (connected) with each other. As for the grammar, a context-free grammar (CFG), a statistical word concatenation probability model (N-gram) or the like is used.
The matching unit 4 creates acoustic models of words (word model) by applying the acoustic models stored in the acoustic model database 5 to the words described in the word dictionary stored in the dictionary database 6. Furthermore, the matching unit 4 concatenates some word models with each other on the basis of the grammar stored in the grammar database 7 and recognizes the speech input via the microphone 1 using the concatenated word models, in accordance with the continuous-distribution HMM algorithm. That is, the matching unit 4 detects a series of word models that results in a highest score (most likelihood) when applied to the feature values output in time sequence from the feature extracting unit 3 and employs a series of words corresponding to the detected series of word models as the result of speech recognition.
More specifically, the matching unit 4 calculates the sum of the occurrence probabilities of respective feature values for a series of words corresponding to the concatenated word models, and employs the sum as the score of the series of words. Of various series of words, one which has a highest score is employed as the speech recognition result.
In general, the score is determined by totally evaluating the acoustic score calculated on the basis of the acoustic model stored in the acoustic model database 5 (hereinafter referred to simply as an acoustic score) and the language score calculated on the basis of the grammar stored in the grammar database 7 (hereinafter referred to simply as a language score).
More specifically, for example, in the case where the HMM method is used, the acoustic score is calculated for each word on the basis of probabilities of occurrences, determined from the acoustic models, of a series of feature values output from the feature extracting unit 3. On the other hand, in the case where the bigram is used, the language score is determined on the basis of the probability of connection between a word of interest and an immediately preceding word. The overall score is then determined by totally evaluating the acoustic scores and the language scores of the respective words (hereinafter, an overall score determined in such a manner will be referred to simply as an overall score), and the speech recognition result is determined on the basis of the overall score.
More specifically, when a series of N words is given, if a kth word is represented by wk and the acoustic and language scores of the word wk are represented by A(wk) and L(wk), respectively, the overall score S of that series of words can be calculated according to, for example, the following equation:S=Σ(A(wk)+Ck×L(wk))   (1)where Σ represents the summation for k=1 to N, and Ck represents the weighting factor for the language score L(wk) of the word wk.
The matching unit 4 performs a matching process to determine N and the series of words w1, w2 , . . . , wN which result in the maximum score calculated according to, for example, equation (1), and the resultant series of words w1, w2, . . . , wN is output as the speech recognition result.
For example, if a user utters speech “New York ni yukitai desu” (“I want to go to New York.”), the speech recognition apparatus shown in FIG. 1 calculates the acoustic scores and the language scores of respective words “New York”, “ni”, “yukitai”, and “desu”. When the calculated acoustic scores and language scores result in a highest overall score, the series of words “New York”, “ni”, “yukitai”, and “desu” is output as the speech recognition result.
In this specific example, if the word dictionary of the dictionary database 6 includes five words “New York”, “ni”, “yukitai”, and “desu” and if they are all the word dictionary includes, then these five words can be arranged into a word series as many ways as 55. Thus, in a simplest manner of evaluation, the matching unit 4 evaluates 55 series of words and selects, from 55 series of words, one series of words that best matches the speech uttered by the user (i.e., series of words having the highest overall score). The number of ways that words can be arranged into a word series is given by the number of words raised to the power of the number of words, and thus the number of word series to be evaluated increases tremendously with the number of words registered in the word dictionary.
Because the number of words included in speech is generally unknown, not only series of five words but also series of a different number of words such as a series of one word, series of two words, and so on, have to be evaluated. This results in a further increase in the number of word series that should be evaluated. Thus, from the viewpoint of the amount of computation and the memory space used in the calculation, it is very important to efficiently determine a most likely word series as the speech recognition result from the huge number of word series.
One technique of reducing the amount of computation and the memory space is to terminate the calculation of the score when the acoustic score becomes lower than a predetermined threshold value in the middle of the acoustic score calculation process. This is called an acoustic pruning method. Another technique is to linguistically prune words on the basis of the language score to reduce the number of words to be evaluated.
By using such a pruning method, it is possible to limit the calculation of the score to particular words selected in accordance with a predetermined criterion (such as the acoustic scores or the language scores of words obtained in the middle of the calculation), thereby reducing the amount of computation. However, if the pruning is performed to a too great extent, that is, if the criterion is too strict, there can be a possibility that a correct word to be included in the speech recognition result is discarded, and thus the speech recognition result becomes wrong. Therefore, when the pruning method is employed, there should be a sufficiently large margin in the pruning process so that correct words to be included in the speech recognition are not discarded. This makes it difficult to greatly reduce the amount of computation. If the calculation of the acoustic score is performed independently for all words to be calculated, a large amount of computation is required. To avoid the above problem, it has been proposed to partially commonize (share) the calculation of the score for a plurality of words. One method of commonizing the calculation for a plurality words whose phonemes in the beginning part are equal to each other is to construct a tree-structured network by applying the same acoustic models to the beginning part having the same phonemes and applying individual acoustic modes to following different phonemes, and determine the acoustic score using the tree-structured network. By way of example, when the word dictionary includes a word “Akita” whose pronunciation information is registered as “akita” and also includes a word “Akebono” whose pronunciation information is registered as “akebono”, the acoustic score of words “Akita” and “Akebono” are calculated in common for the first and second phonemes a and k. The acoustic scores of the remaining phonemes k, i, t, and a of the word “Akita” and the acoustic scores of the remaining phonemes e, b, o, n, and o of the word “Akebono” are calculated independently.
This technique allows a great reduction in the amount of computation required to determine the acoustic scores.
However, in this technique, when the acoustic score of a common part of words is being calculated, it is impossible to identify which word is being subjected to the calculation of the acoustic score. In the specific example of calculation of acoustic scores for words “Akita” and “Akebono”, when the acoustic scores of the first and second phonemes a and k are being calculated, it is impossible to identify whether the acoustic score is being calculated for “Akita” or “Akebono”.
As for “Akita”, in this specific case, when the calculation of the acoustic score of the third phoneme “i” is started, it becomes possible to identify that the word being calculated is “Akita”. Similarly, in the case of the word “Akebono”, when the calculation of the acoustic score of the third phoneme “e” is started, it becomes possible to identify that the word being calculated is “Akebono”.
That is, if the calculation of the acoustic scores is performed in common for overlapping parts of a plurality of words, it is impossible, at the point of time at which the calculation of the acoustic score of a word is started, to identify which word is currently being calculated, and thus it is impossible to apply a corresponding language score to that word. Therefore, it is difficult to perform the word pruning process before starting the calculation of the acoustic scores of words. This causes unnecessary calculation to be performed.
Furthermore, in the case where the calculation of acoustic scores is partially commonized, a tree-structured network including all words of the word dictionary is formed, and a large memory space is needed to store the tree-structured network.
Another technique of reducing the amount of computation and the memory space is to, instead of calculating the acoustic scores for all words registered in the word dictionary, preliminary select words and calculate the acoustic scores for only those preliminary selected words. The preliminary selection is performed on the basis of an acoustic model or grammar which is simple but not very strict.
An example of the manner of preliminary selection can be found in “A Fast Approximate Acoustic Match for Large Vocabulary Speech Recognition” (V. De Gennaro, P. S. Gopalakrishnan and R. L. Mercer, IEEE Trans. Speech and Audio Proc., vol. 1, pp. 59-67, 1993),
When the acoustic score of a word is calculated using a series of feature values of a given speech, the acoustic score changes depending on the location of the start point or the end point of the series of feature values used in the calculation. This change in the acoustic score affects the overall score determined from the acoustic scores and the language scores in accordance with equation (1).
The start and end points of a series of feature values corresponding to a word, i.e., boundaries between adjacent words (word boundaries) may be determined, for example, by a dynamic programming method. In this technique, arbitrary points of a series of feature values are taken as candidates for word boundaries, and the acoustic score and the language score are calculated for each word of the series of words taken as a candidate for the speech recognition result, and the total score of the acoustic score and the language score (hereinafter, such a total score will be referred to as a word score) is cumulatively added from one word to another. A maximum cumulative sum of word scores and candidate word boundaries which result in the maximum cumulative sum are stored.
When the maximum cumulative sum of language scores is finally determined, the word boundaries that result in the best cumulative sum, i.e., the maximum cumulative sum, are also determined.
The method of determining the word boundaries in the above-described manner is called Viterbi decoding or one path decoding. A more detailed description of this method can be found, for example, in an article entitled “Speech Recognition Using a Stochastic Model” by Seiichi Nakagawa published in the Journal of the Institute of Electronics, Information and Communication Engineers, pp. 20-26, Jul. 1, 1988.
In order to efficiently perform the preliminary selection described above, it is very important to properly determine word boundaries. That is, it is very important to properly select a start point of each word in a series of feature values (feature value series).
In a specific example of speech “Kyo wa yoi tenki desune.” (which is equivalent, as a whole of sentence, to “What a beautiful day!” wherein on a more strict word-to-word correspondence basis, “kyo” corresponds to “today”, “yoi” to “good”, “tenki” to “weather”, and “desune” to “isn't it”, and “wa” is a particle having no corresponding English word.) shown in FIG. 2(A), let us assume that the correct boundary between words “kyo” and “wa” is located at a point of time t1. When the word “wa” following the word “kyo” is preliminary selected, if a time t1−1 before the correct time t1 is employed as the start point thereof, then not only the feature value of the word “wa” but also an end part of the word “kyo” immediately before the word “wa” affects the preliminary selection. Conversely, if the preliminary selection is performed such that a point of time t1+1 after the correct time t1 is employed as the start point, then the feature value of a beginning part of the word “wa” is not used in the preliminary selection.
In any case, the incorrect starting point adversely affects the preliminary selection and further affects a matching process performed thereafter.
In FIG. 2 (and also FIGS. 4 and 6 which will be referred to later), time elapses from left to right, and speech starts at a time 0 and ends at a time T.
In the dynamic programming method, because the final word boundaries cannot be determined until the calculation of the word score (the acoustic score and the language score) for the last feature value, for example, the feature value at the end time T of a given speech duration in the specific example shown in FIG. 2, is completed, it is difficult to uniquely determine start points of preliminary selected words, i.e., word boundaries at the stage at which the preliminary selection is performed.
In view of the above, candidate word boundaries may be retained until the calculation of word scores using a feature value series in a speech duration is completed.
In this method, for example, when the word score for “kyo” is calculated by employing the start time, 0, of the speech as the start point of the word, if times t1−1, t1, and t1+1 are obtained as candidates for the end point of the uttered word “kyo”, then these three times t1−1, t1, and t1+1 are retained, and preliminary selections of the next word are made by employing these three times as the start point of the next word.
Herein, let us assume that when the time t1−1 is employed as the start point of the preliminary selection, two words “wa” and “yoi” are obtained, when the time t1 is employed as the start point, one word “wa” is obtained, and when the time t1+1 is employed as the start point, two words “wa” and “yoi” are obtained. Let us further assume that by calculating the word scores for the respective words described above, candidates for a partial word series are obtained as shown in FIG. 2(B) to 2(G).
That is, in FIG. 2(B), the word score for the word “wa” is calculated by employing the time t1−1 as the start point, and a time t2 is obtained as a candidate for the end point thereof. In FIG. 2(C), the word score for the word “yoi” is calculated by employing the time t1−1 as the start point, and a time t2+1 is obtained as a candidate for the end point thereof. In FIG. 2(D), the word score for the word “wa” is calculated by employing the time t1 as the start point, and a time t2+1 is obtained as a candidate for the end point thereof. In FIG. 2(E), the word score for the word “wa” is calculated by employing the time t1 as the start point, and a time t2 is obtained as a candidate for the end point thereof. In FIG. 2(F), the word score for the word “wa” is calculated by employing the time t1+1 as the start point, and a time t2 is obtained as a candidate for the end point thereof. In FIG. 2(G), the word score for the word “wa” is calculated by employing the time t1+1 as the start point, and a time t2+2 is obtained as a candidate for the end point thereof. In FIG. 2, t1−1<t1<t1+1<t2<t2+1<t2+2. 
Of the calculations shown in FIG. 2(B) to 2(G), those shown in FIGS. 2(B), 2(E), and 2(F) have the same series of words “kyo” and “wa” as a candidate for the speech recognition result, and the end point of the last word “wa” of the series of words is equally located at the time t2. Thus, it is possible to select a best one from those shown in FIGS. 2(B), 2(E), and 2(F) on the basis of the cumulative sum of word scores calculated for series of words ending at the time t2 and can discard the other.
However, at this point of time, it is impossible to select a correct one from a group consisting of the candidate selected above and the remaining three candidates shown in FIGS. 2(C), 2(D), and 2(G) and the candidate shown in FIGS. 2(C), 2(D), and 2(G). Thus these four candidates are retained, and a further preliminary selection of a following word is made for each of these four candidates.
Thus, in the calculation of word scores according to this method, it is needed to retain a large number of candidates for word boundaries until the calculation of word scores for a series of feature values in a speech duration is finally completed. This is undesirable from the standpoint of the amount of computation and the efficient use of the memory space.
In this technique, if all correct word boundaries are included in retained candidates for word boundaries, it is theoretically possible to finally obtain the correct word boundaries as can be obtained by the dynamic programming method. However, if a correct word boundary is not included in the retained candidates, a word starting from that boundary or a word ending at that boundary is recognized wrongly, and this can further cause following words to be wrongly recognized.
In general, preliminary selections are made on the basis of an acoustic model or a grammar which is simple but not very strict. Because a preliminary selection is made from all words registered in the word dictionary, if a high-precision acoustic model or grammar is employed in the preliminary selection, it is needed to perform a large amount of calculation using a large memory space in real time. To avoid the above problem, the preliminary selection is performed using a simple acoustic model and a simple grammar thereby making it possible to perform the preliminary selection using a relatively small amount of resource at a high speed even when the preliminary selection is made from a set of huge number of words.
In the preliminary selection, after determining a likely end point of a word by means of a matching process using a series of feature values (feature value series), a preliminary selection of a following word starting at the end point of the preliminarily selected previous word is made using a feature value series starting at a point of time corresponding to the start point of the following word. That is, the preliminary selection is made at a processing stage at which boundaries (word boundaries) between words included in a speech utterance have not been finally determined.
Therefore, if the start point or the end point of a series feature values used in the preliminary selection has a deviation from the start point or the end point of a corresponding word, then the series of feature values employed in the preliminary selection includes some feature values of phonemes of a word immediately before or after the present word and lacks some parts of feature values at the beginning or the end of the present word. Thus, the preliminary selection is performed using the series of feature values which are acoustically unstable.
Therefore, in the case where a simple acoustic model is used in the preliminary selection, there is a possibility that a word included in a speech utterance is not selected. If a word included in a speech utterance is missed in the preliminary selection, the matching process is not performed for that missed word, and thus the resultant speech recognition becomes wrong.
One technique of avoiding the above problem is to reduce the rigidity of the acoustic and/or linguistic criteria used in the preliminary selection so that a greater number of words are selected. Another technique is to employ a high-precision acoustic model and/or grammar.
However, if the rigidity of the acoustic and/or linguistic criteria used in the preliminary selection is reduced, it becomes necessary to perform the matching process on a large number of words having a low possibility of being selected in the final speech recognition result. This results in great increases in the amount of calculation and the memory space needed to perform the matching process which needs a greater amount of calculation and a greater memory space per word than needed in the preliminary selection process.
On the other hand, if a high-precision acoustic model and/or grammar is employed in the preliminary selection, the result is a great increase in the resource needed in the preliminary selection.