1. Field of the Invention
The present invention relates to speech recognition apparatuses, speech recognition methods and recording media and more particularly, to a speech recognition apparatus, a speech recognition method, and a recording medium which allow highly precise speech recognition to be applied to a large vocabulary.
2. Description of the Prior Art
FIG. 1 shows an example structure of a conventional speech recognition apparatus.
Speech uttered by the user is input to a microphone 1, and the microphone 1 converts the input speech to an audio signal, which is an electric signal. The audio signal is sent to an analog-to-digital (AD) conversion section 2. The AD conversion section 2 samples, quantifies, and converts the audio signal, which is an analog signal sent from the microphone 1, into audio data which is a digital signal. The audio data is sent to a feature extracting section 3.
The feature extracting section 3 applies acoustic processing to the audio data sent from the AD conversion section 2 in units of an appropriate number of frames to extract a feature amount, such as a E1 frequency cepstrum coefficient (MFCC), and sends it to a matching section 4. The feature extracting section 3 can extract other feature amounts, such as spectra, linear prediction coefficients, cepstrum coefficients, and line spectrum pairs.
The matching section 4 uses the feature amount sent from the feature extracting section 3 and refers to an acoustic-model data base 5, a dictionary data base 6, and a grammar data base 7, if necessary, to apply speech recognition, for example, by a continuous-distribution HMM method to the speech (input speech) input to the microphone 1.
More specifically, the acoustic-model data base 5 stores acoustic models indicating acoustic features of each phoneme and each syllable in a linguistic aspect of the speech to which speech recognition is applied. Since speech recognition is applied according to the continuous-distribution hidden-Markov-model (Hand) method, HMM is, for example, used as an acoustic model. The dictionary data base 6 stores a word dictionary in which information (phoneme information) related to the pronunciation of each word (vocabulary) to be recognized is described. The grammar data base 7 stores a grammar rule (language model) which describes how each word input into the word dictionary of the dictionary data base 6 is chained (connected). For example, the grammar rule may be a context free grammar (CFG) or a rule based on statistical word chain probabilities (N-gram).
The matching section 4 connects acoustic models stored in the acoustic-model data base 5 by referring to the word dictionary of the dictionary data base 6 to constitute word acoustic models (word models). The matching section 4 further connects several word models by referring to the grammar rule stored in the grammar data base 6, and uses the connected word models to recognize the speech input to the microphone 1, by the continuous-distribution HMM method according to feature amounts. In other words, the matching section 4 detects a series of word models having the highest of scores (likelihoods) indicating probabilities of observing the time-sequential feature amounts output from the feature extracting section 3, and outputs the word string corresponding to the series of word models as the result of speech recognition.
In other words, the matching section 4 accumulates the probability of occurrence of each feature amount for word strings corresponding to connected word models, uses an accumulated value as a score, and outputs the word string having the highest score as the result of speech recognition.
A score is generally obtained by the total evaluation of an acoustic score (hereinafter called acoustics score given by acoustic models stored in the acoustic□model data base 5 and a linguistic score (hereinafter called language score) given by the grammar rule stored in the grammar data base 7.
More specifically, the acoustics score is calculated, for example, by the HMM method, for each word from acoustic models constituting a word model according to the probability (probability of occurrence) by which a series of feature amounts output from the feature extracting section 3 is observed. The language score is obtained, for example, by bigram, according to the probability of chaining (linking) between an aimed-at word and a word disposed immediately before the aimed-at word. The result of speech recognition is determined according to the final score (hereinafter called final score) obtained from a total evaluation of the acoustics score and the language score for each word.
Specifically, the final score S of a word string formed of N words is, for example, calculated by the following expression, where wk indicates the k-th word in the word string, A(wk) indicates the acoustics score of the word wk, and L(wk) indicates the language score of the word.S=(A(wk)+Ck×L(wk))  (1)
indicates a summation obtained when k is changed from 1 to N. Ck indicates a weight applied to the language score L(wk) of the word wk. The matching section 4 performs, for example, matching processing for obtaining N which makes the final score represented by the expression (1) highest and a word string w1, w2, . . . , and wN, and outputs the word string w1, w2, . . . , and WN as the result of speech recognition.
With the above-described processing, when the user utters “New York ni ikitai desu,” the speech recognition apparatus shown in FIG. 1 calculates an acoustics score and a language score for each word, “New York,” “ni,” “ikitai,” or “desu.” When their final score obtained from a total evaluation is the highest, the word string, “New York,” “ni,” “ikitai,” and “desu,” is output as the result of speech recognition.
In the above case, when five words, “New,” “York,” “ni,” “ikitai,” and “desu,” are stored in the word dictionary of the dictionary data base 6, there are 55 kinds of five-word arrangement which can be formed of these five words. Therefore, it can be said in a simple way that the matching section 4 evaluates 55 word strings and determines the most appropriate word string (word string having the highest final score) for the user's utterance among them. If the number of words stored in the word dictionary increases, the number of word strings formed of the words is the number of words multiplied by itself the-number-of-words times. Consequently, a huge number of word strings should be evaluated.
In addition, since the number of words included in utterance is generally unknown, not only word strings formed of all words stored in the word dictionary but word strings formed of one word, two words, and should be evaluated. Therefore, the number of word strings to be evaluated becomes much larger. It is very important to efficiently determine the most likely word string among a huge number of word strings as the result of speech recognition in terms of the amount of calculation and a memory capacity to be used.
To make an efficient use of the amount of calculation and the memory capacity to be used, some measures are taken such as an acoustic branch-cutting technique for stopping score calculation when an acoustics score obtained during a process for obtaining an acoustics score becomes equal to or less than a predetermined threshold, or a linguistic branch-cutting technique for reducing the number of words for which score calculation is performed, according to language scores.
According to these branch-cutting techniques, since words for which score calculation is performed is reduced according to a predetermined determination reference (such as an acoustics score obtained during calculation, described above, and a language score given to a word), the amount of calculation is reduced. If many words are reduced, namely, if a severe determination reference is used, however, even a word which is to be correctly obtained as a result of speech recognition is also removed, and erroneous recognition occurs. Therefore, in the branch-cutting techniques, word reduction needs to be performed with a margin provided to some extent so as not to remove a word which is to be correctly obtained as a result of speech recognition. Consequently, it is difficult to largely reduce the amount of calculation.
When acoustics scores are obtained independently for all words for which score calculation is to be performed, the amount of calculation is large. Therefore, a method has been proposed for making a common use of (sharing) a part of acoustics-score calculation for a number of words. In this sharing method, a common acoustic model is applied to words stored in the word dictionary, having the same first phoneme, from the first phoneme to the sane last phoneme, and acoustic models are independently applied to the subsequent phonemes to constitute one tree-structure network as a whole and to obtain acoustics scores. More specifically, for example, the words, “akita” and “akebono,” are considered. When it is assumed that the phoneme information of “akita” is “akita” and that of “akebono” is “akebono,” the acoustics scores of the words, “akita” and “akebono,” are calculated in common for the first to second phonemes “a” and “k.” Acoustics scores are independently calculated for the remaining phonemes “i,” “t,” and “a” of the word “akita” and the remaining phonemes “e,” “b,” “o,” “n,” and “o” of the word “akebono.” Therefore, according to this method, the amount of calculation performed for acoustics scores is largely reduced.
In this method, however, when a common part is calculated (acoustics scores are calculated in common), the word for which acoustics scores are being calculated cannot be determined. In other words, in the above example of the words, “akita” and “akebono,” when acoustics scores are being calculated for the first and second phonemes “a” and “k,” it cannot be determined whether acoustics scores are calculated for the word “akita” or the word “akebono.”
In this case, as for “akita,” when the calculation of an acoustics score starts for its third phoneme, “i,” it can be determined that the word for which the calculation is being performed is “akita.” Also as for “akebono,” when the calculation of an acoustics score starts for its third phoneme, “e,” it can be determined that the word for which the calculation is being performed is “akebono.”
Therefore, when a part of acoustics-score calculation is shared, a word for which the calculation is being performed cannot be identified when the acoustics-score calculation starts. As a result, it is difficult to use the above-described linguistic branch-cutting method before the start of acoustics-score calculation. Wasteful calculation may be performed.
In addition, when a part of acoustics-score calculation is shared, the above-described tree-structure network is formed for all words stored in the word dictionary. A large memory capacity is required to hold the network. To make an efficient use of the amount of calculation and the memory capacity to be used, another technique may be taken in which acoustics scores are calculated not for all words stored in the word dictionary but only for words preliminary selected.
Since the preliminary selection is generally applied to many words, simple acoustic models or a simple grammar rule which does not have very high precision is used in terms of a processing speed.
A method for preliminary selection is described, for example, in “A Fast Approximate Acoustic Match for Large Vocabulary Speech Recognition,” IEEE Trans. Speech and Audio Proc., vol. 1, pp. 59-67, 1993, written by L. R. Bahl, S. V. De Gennaro, P. S. Gopalakrishnan and R. L. Mercer.
The acoustics score of a word is calculated by using a series of feature amounts of speech. When the starting point or the ending point of a series of a feature amount to be used for calculation is different, an acoustics score to be obtained is also changed. This change affects the final score obtained by the expression (1), in which an acoustics score and a language score are totally evaluated.
The starting point and the ending point of the series of feature amounts corresponding to a word, namely, the boundaries (word boundaries) of words, can be obtained, for example, by a dynamic programming method. A point in the series of a feature amount is set to a candidate for a word boundary, and a score (hereinafter called a word score) obtained by totally evaluating an acoustics score and a language score is accumulated for each word in a word string, which serves as a candidate for a result of speech recognition. The candidates for word boundaries which give the highest accumulated values are stored together with the accumulated values.
When the accumulated values of word scores have been obtained, word boundaries which give the highest accumulated values, namely, the highest scores, are also obtained.
The method for obtaining word boundaries in the above way is called Viterbi decoding or one-pass decoding, and its details are described, for example, in “Voice Recognition Using Probability Model,” the Journal of the Institute of Electronics, Information and Communication Engineers, pp. 20-26, Jul. 1, 1988, written by Seiichi Nakagawa.
To effectively perform the above-described preliminary selection, it is very important to determine word boundaries, that is, to determine a starting point in a series (feature-amount series) of a feature amount.
Specifically, in a feature-amount series obtained from a speech “kyouwaiitenkidesune” shown in FIG. 2(A), for example, when a correct word boundary is disposed at time ti between “kyou” and “wa,” if time t1—1, which precedes the correct time t1, is selected as a starting point in preliminary selection for the word “wa” following the word “kyou,” not only the feature amount of the word “wa” but also the last portion of the feature amount of the word “kyou” affects the preliminary selection. If time t1+1, which follows the correct time ti, is selected as a starting point in preliminary selection for the word “wa,” the beginning portion of the feature amount of the word “wa” is not used in the preliminary selection.
In either case, if a starting point is erroneously selected, an adverse effect is given to preliminary selection and then to matching processing performed thereafter.
In FIG. 2 (also in FIG. 5 and FIG. 7, described later), time passes in a direction from the left to the right. The starting time of a speech zone is set to 0, and the ending time is set to time T.
In the dynamic programming method, described above, since final word boundaries cannot be determined until word scores (acoustics scores and language scores) have been calculated to the end of a feature-amount series, that is, to the ending time T of the speech zone in FIG. 2, it is difficult to uniquely determine word boundaries which serve as starting points in preliminary selection when the preliminary selection is performed.
To solve this issue, a technique has been proposed in which candidates for word boundaries are held until word scores have been calculated by using a feature-amount series in a speech zone.
In this technique, when a word score is calculated for the word “kyou” with the starting time 0 of the speech zone being used as a start point, and times t1−1, t1, and t1+1 are obtained as candidates for the ending point of the utterance of the word “kyou,” for example, these three times t1, and t1+1 are held and preliminary selection for the next word is executed with each of these times being used as a starting point.
In the preliminary selection, it is assumed that, when the time t1—1 is used as a starting point, two words “wa” and “ii” are obtained; when the time t1 is used as a starting point, one word “wa” is obtained; and when the time t1+1 is used as a starting point, two words “wa” and “ii” are obtained. It is also assumed that a word score is calculated for each of these words and results shown in FIG. 2(B) to FIG. 2(G) are obtained.
Specifically, FIG. 2(B) shows that a word score is calculated for the word “wa” with the time t1—1 being used as a starting point and time t2 is obtained as a candidate for an ending point. FIG. 2(C) shows that a word score is calculated for the word “ii” with the time t1—1 being used as a starting point and time t2+1 is obtained as a candidate for an ending point. FIG. 2(D) shows that a word score is calculated for the word “wa” with the time t1 being used as a starting point and time t2+1 is obtained as a candidate for an ending point. FIG. 2(E) shows that a word score is calculated for the word “wa” with the time t1 being used as a starting point and time t2 is obtained as a candidate for an ending point. FIG. 2(F) shows that a word score is calculated for the word “wa” with the time t1+1 being used as a starting point and time t2 is obtained as a candidate for an ending point. FIG. 2(G) shows that a word score is calculated for the word “ii” with the time t1+1 being used as a starting point and time t2+2 is obtained as a candidate for an ending point. In FIG. 2, t1—1<t1 <t1+1<t2 <t2+1<t2+2.
Among FIG. 2(B) to FIG. 2(G), FIG. 2(B), FIG. 2(E), and FIG. 2(F) show that the same word string, “kyou” and “wa,” are obtained as a candidate for a result of speech recognition, and that the ending point of the last word “wa” of the word string is at the time t2. Therefore, it is possible that the most appropriate case is selected among them, for example, according to the accumulated values of the word scores obtained up to the time t2 and the remaining cases are discarded.
At the current point of time, however, a correct case cannot be identified among a case selected from those shown in FIG. 2(B), FIG. 2(E), and FIG. 2(F), plus cases shown in FIG. 2(C), FIG. 2(D), and FIG. 2(G). Therefore, these four cases need to be held. Preliminary selection is again executed for these four cases.
Therefore, in this technique, word scores need to be calculated while many word-boundary candidates are held until word-score calculation using a feature-amount series in a speech zone is finished. It is not preferred in terms of an efficient use of the amount of calculation and the memory capacity.
Also in this case, when truly correct word boundaries are held as candidates for word boundaries, the same correct word boundaries are finally obtained in principle as those obtained in a case in which the above-described dynamic programming technique is used. If a truly correct word boundary is not held as a candidate for a word boundary, a word having the word boundary as its starting point or as its ending point is erroneously recognized and, in addition, due to this erroneous recognition, a word following the word may be erroneously recognized.
In recent years, acoustic models which depend on (consider) contexts have been used. Acoustic models depending on contexts refer to acoustic models even for the same syllable (or phoneme) which have been modeled as different models according to a syllable disposed immediately before or immediately after. Therefore, for example, a syllable “a” is modeled by different acoustic models between cases in which a syllable disposed immediately before or immediately after is “ka” and “sa.”
Acoustic models depending on contexts are divided into those depending on contexts within words and those depending on contexts which extend over words.
In a case in which acoustic models depending on contexts within words are used, when a word model “kyou” is generated by coupling acoustic models “kyo” and “u,” an acoustic model “kyo” depending on the syllable “u” coming immediately thereafter (acoustic model “kyo” with the syllable “u” coming immediately thereafter being considered) is used, or an acoustic model “u” depending on the syllable “kyo” coming immediately therebefore is used.
In a case in which acoustic models depending on contexts which extend over words are used, when a word model “kyou” is generated by coupling acoustic models “kyo” and “u,” if the word coming immediately thereafter is “wa,” an acoustic model “u” depending on the first syllable “wa” of the word coming immediately thereafter. Acoustic models depending on contexts which extend over words are called cross-word models.
When cross-word models are applied to speech recognition which performs preliminary selection, a relationship with a word disposed immediately before a preliminary selected word can be taken into account, but a relationship with a word disposed immediately after the preliminary selected word cannot be considered because the word coming immediately thereafter is not yet determined.
To solve this problem, a method has been developed in which a word which is highly likely to be disposed immediately after a preliminary selected word is obtained in advance, and a word model is created with the relationship with the obtained word taken into account. More specifically, for example, when words “wa,” “ga,” and “no” are highly likely to be disposed immediately after the word “kyou,” the word model is generated by using acoustic models “u” depending on “wa” “ga,” and “no,” which correspond to the last syllable of word models for the word “kyou.”
Since unnecessary contexts are always taken into account, however, this method is not desirable in terms of an efficient use of the amount of calculation and the memory capacity.
For the same reason, it is difficult to calculate the language score of a preliminary selected word with the word disposed immediately thereafter being taken into account.
As a speech recognition method in which not only a word preceding an aimed-at word but also a word following the aimed-at word are taken into account, there has been proposed a two-pass decoding method, described, for example, in “The N-Best Algorithm: An Efficient and Exact Procedure for Finding The Most Likely Sentence Hypotheses,” Proc. ICASSP, pp.81-84, 1990, written by R. Schwarts and Y. L. Chow.
FIG. 3 shows an outlined structure of a conventional speech recognition apparatus which executes speech recognition by the two-pass decoding method.
In FIG. 3, a matching section 41 performs, for example, the same matching processing as the matching section 4 shown in FIG. 1, and outputs a word string obtained as the result of the processing. The matching section 41 does not output only one word string serving as the final speech-recognition result among a number of word strings obtained as the results of the matching processing, but outputs a number of likely word strings as candidates for speech-recognition results.
The outputs of the matching section 41 are sent to a matching section 42. The matching section 42 performs matching processing for re-evaluating the probability of determining each word string among the number of word strings output from the matching section 41, as the speech-recognition result. In a word string output from the matching section 41 as a speech-recognition result, since a word has not only a word disposed immediately therebefore but also a word disposed immediately thereafter, the matching section 42 uses cross-word models to obtain a new acoustics score and a new language score with not only the word disposed immediately therebefore but also the word disposed immediately thereafter being taken into account. The matching section 42 determines and outputs a likely word string as the speech-recognition result according to the new acoustics score and language score of each word string among the number of word strings output from the matching section 41.
In the two-pass decoding, described above, generally, simple acoustic models, a word dictionary, and a grammar rule which do not have high precision are used in the matching section 41, which performs first matching processing, and acoustic models, a word dictionary, and a grammar rule which have high precision are used in the matching section 42, which performs subsequent matching processing. With this configuration, in the speech recognition apparatus shown in FIG. 3, the amounts of processing performed in the matching sections 41 and 42 are both reduced and a highly precise speech-recognition result is obtained.
FIG. 3 shows a two-pass-decoding speech recognition apparatus, as described above. There has also been proposed a speech-recognition apparatus which performs multi-pass decoding, in which the same matching sections are added after the matching section 42 shown in FIG. 3.
In two-pass decoding and multi-pass decoding, however, until the first matching processing has been finished, the next matching processing cannot be achieved. Therefore, a delay time measured from when a speech is input to when the final speech-recognition result is output becomes long.
To solve this problem, there has been proposed a method in which, when first matching processing has been finished for several words, subsequent matching processing is performed for the several words with cross-word models being used, and this operation is repeated for other words. The method is described, for example, in “Evaluation of a Stack Decoder on a Japanese Newspaper Dictation Task,” Onkoron, 1-R-12, pp.141-142, 1997, written by M. S chuster.
In the speech recognition apparatuses shown in FIG. 1 and FIG. 3, when continuous speech recognition is performed, words to be recognized are limited due to the calculation speeds and the memory capacities of the apparatuses. For example, ViaVoice (trademark) GOLD, speech recognition software developed by IBM, recognizes about 42,000 words in a default condition. The user can add about 20,000 words to be recognized. Therefore, ViaVoice GOLD can recognize more than 60,000 words. Even in this condition, a great number of words, such as many proper nouns, are not to be recognized.
When only a limited number of words are to be speech-recognized, if the user utters a word (hereinafter called an unknown word, as required) which is not to be recognized, various problems occur.
Since the phoneme information of the unknown word has not been input into any used word dictionary, its acoustics score cannot be correctly calculated. In addition, the unknown word is not handled in any used grammar rule, its language score cannot be correctly calculated either. Therefore, when a word string serving as the result of recognition of the user's speech is determined, an error occurs at the unknown word. Furthermore, this error causes another error to occur at a different portion.
Specifically, when the user utters “New York niikitaidesu” as described above, for example, if “New York” are unknown words, the correct acoustics scores and language scores of “New York” cannot be calculated. In addition, since the correct acoustics scores of “New York” cannot be calculated, an error occurs when a word boundary between “New York” and “ni” following them is determined. The error affects the calculation of the acoustics score of another portion.
Words which are frequently used in newspapers and novels are generally selected as words to be recognized in a speech recognition apparatus. It is not sure that the user does not utter words which are not frequently used. Therefore, it is necessary to take some measure for unknown words, or to reduce the number of unknown words as much as possible.
There is a method, for example, in which a topic which the user will talk about is presumed from the user's utterance; words to be recognized are changed according to the result of presumption, and unknown words are nominally reduced. In “Reducing the {OOV} rate in broadcast news speech recognition,” Proceedings of International Conference on Spoken Language Processing, 1998, written by Tomas Kemp and Alex Waibel, for example, a method is described in which a sentence data base is searched for a sentence which includes a word (known word) uttered by the user, and words included in the sentence are added to words to be recognized.
To highly precisely presume a topic which the user will talk about from the user's utterance, however, complicated and heavy-load processing is required. In addition, when the presumption of the topic is erroneous, it is possible that many words which the user will utter are removed from words to be recognized. It is also difficult to highly precisely presume all topics which the user will talk about.
In “00V-detection in large vocabulary system using automatically defined word-fragments as fillers,” Proceedings on 6th European conference on speech corrununication and technology, 1999, written by Dietrich Klakow, Georg Rose, and Xavier Aubert, for example, a method is described in which a word which is not to be recognized is divided into fragments such as phonemes constituting the word or a phoneme string formed of several phonemes, and speech recognition is applied to the fragments serving as a pseudo-word.
Since there are not so many phonemes constituting words or not so many phoneme strings, the number of unknown words nominally becomes zero when speech recognition is applied to such phonemes and phoneme strings serving as pseudo-words.
In this case, however, since each phoneme or each phoneme string serves as a unit to be recognized, when a word formed of a series of such units to be recognized is unknown, a grammar rule cannot be applied to the word. This reduces the precision of speech recognition.
In addition, in a case in which matching processing is performed after preliminary selection, when phonemes or phoneme strings are preliminary selected as pseudo-words, if an erroneous preliminary selection of phonemes or phoneme strings occurs, the error reduces the precision of a score obtained in matching processing which is performed thereafter. The reduction of the precision of the score reduces the precision of speech recognition.
The present invention has been made in consideration of the above conditions. Indeed, an object of the present invention is to allow highly precise, high-speed speech recognition to be applied to a large vocabulary.