1. Field of the Invention
The present invention relates to a speech recognition apparatus, a speech recognition method, and a recording medium. More particularly, the present invention relates to a speech recognition apparatus and a speech recognition method which are capable of reducing degradation of speech recognition accuracy, for example, in a case where an unknown word is contained in an utterance, and to a recording medium therefor.
2. Description of the Related Art
FIG. 1 shows the construction of an example of a conventional speech recognition apparatus for performing continuous speech recognition.
Speech produced by a user is input to a mike (microphone) 1. In the microphone 1, the input speech is converted into an audio signal as an electrical signal. This audio signal is supplied to an AD (Analog-to-Digital) conversion section 2. In the AD conversion section 2, the audio signal, which is an analog signal, from the microphone 1 is sampled and quantized, and is converted into audio data which is a digital signal. This audio data is supplied to a feature extraction section 3.
The feature extraction section 3 performs, for each appropriate frame, acoustic processing, such as Fourier transforming and filtering, on the audio data from the AD conversion section 2, thereby extracting features, such as, for example, MFCC (Mel Frequency Cepstrum Coefficient), and supplies the features to a matching section 4. Additionally, it is possible for the feature extraction section 3 to extract features, such as a spectrum, a linear prediction coefficient, a cepstrum coefficient, and a line spectrum pair.
The matching section 4 performs speech recognition of speech input to the matching section 4 (input speech) based on, for example, the continuous distribution HMM method, while referring to a sound model database 5, a dictionary database 6, and a grammar database 7 as necessary by using the features from the feature extraction section 3.
More specifically, the sound model database 5 stores therein a sound model showing acoustic features of individual sound elements and syllables in a spoken language for which speech recognition is performed. Here, since speech recognition is performed based on the continuous distribution HMM method, for the sound model, for example, HMM (Hidden Markov Model) is used. The dictionary database 6 stores therein word dictionaries in which information for the pronunciation (phonological information) of each word (vocabulary) which is the object of speech recognition is described. The grammar database 7 stores therein grammar rules (language models) for the way in which each word entered in the word dictionary of the dictionary database 6 is connected (chained). Here, as the grammar rule, for example, a rule based on context free grammar (CFG), statistical word sequencing probability (N-gram), etc., can be used.
The matching section 4 connects sound models stored in the sound model database 5 by referring to the word dictionary of the dictionary database 6, thereby forming a sound model (word model) of the word. Furthermore, the matching section 4 connects several word models by referring to the grammar rules stored in the grammar database 7, and uses the word model which is connected in that manner in order to recognize, based on the features, the speech input to the microphone 1 by the continuous distribution HMM method. That is, the matching section 4 detects a series of word models in which the score (likelihood) at which the features of the time series output by the feature extraction section 3 are observed is greatest, and outputs a word sequence corresponding to that series of word models as the speech recognition result.
More specifically, the matching section 4 accumulates the appearance probability of each feature for the word sequence corresponding to the connected word model, assumes the accumulated value as a score, and outputs the word sequence which maximizes the score as a speech recognition result.
The score calculation is generally performed by jointly evaluating an acoustic score (hereinafter referred to as an “acoustic score” where appropriate) given by the sound model stored in the sound model database 5 and a linguistic score (hereinafter referred to as a “linguistic score” where appropriate) given by the grammar rule stored in the grammar database 7.
More specifically, for example, in the case of the HMM method, the acoustic score is calculated, for each word from the acoustic models which form a word model, based on the probability at which the sequence of features output by the feature extraction section 3 is observed (appearance probability). Also, for example, in the case of a bigram, the linguistic score is determined based on the probability at which a particular word and a word immediately before that word are connected (chained). Then, the speech recognition result is determined based on a final score (hereinafter referred to as a “final score” where appropriate) obtained by jointly evaluating the acoustic score and the linguistic score for each word.
Specifically, when a k-th word in a word sequence composed of N words is denoted as wk, and when the acoustic score of the word wk is denoted as A(wk) and the linguistic score is denoted as L(wk), the final score of that word sequence is calculated, for example, based on the following equation:S=Σ(A(wk)+Ck×L(wk))  (1)where Σ represents summation by varying k from 1 to N, and Ck represents a weight applied to the linguistic score L(wk) of the word wk.
The matching section 4 performs a matching process for determining, for example, N by which the final score shown in equation (1) is maximized and a word sequence w1, w2, . . . , wN, and outputs the word sequence w1, w2, . . . , wN as the speech recognition result.
As a result of processing such as that described above being performed, in the speech recognition apparatus in FIG. 1, for example, when a user utters “ (I want to go to New York)”, an acoustic score and a linguistic score are given to each word, such as “”, “”,“”, and “”. When the final score obtained by jointly evaluating those is greatest, a word sequence “”, “”, “”, and “” is output as a speech recognition result.
If the calculation of the acoustic score is performed independently for all the words entered in the word dictionary of the dictionary database 6, since the amount of calculations is large, a method of making common (sharing) portions of calculations of the acoustic score for a plurality of words may be used. That is, there is a method in which, of the words of the word dictionary, for the words whose phonemes at the start thereof are the same, a common acoustic model is used from the start phoneme up to the phoneme which is the same as the start phoneme, and individual acoustic models are used for the phonemes thereafter, thereby forming one tree-structured network as a whole, and an acoustic score is determined by using this network.
In this case, for example, as shown in FIG. 2, the word dictionary is formed by a network of words of a tree structure (word network), which is obtained by sequentially connecting branches corresponding to the phonemes from the start of each word which is the object of speech recognition, from a root node which is a starting point.
When the word network is formed, for the words whose phonemes at the start thereof are the same, in the manner described above, branches corresponding to the start phoneme up to the phoneme which is the same as the start phoneme are commonly used. That is, in FIG. 2, an alphabetic character surrounded by slashes (/) attached to each branch indicates a phoneme, and a portion enclosed by a rectangle indicates a word. For example, for words “I”, “ice”, “icy”, and “up”, the phoneme /A/ at the start thereof is the same and, therefore, a common branch corresponding to the phoneme /A/ is made. Also, for the words “I”, “ice”, and “icy”, since the second phoneme /I/ thereof is also the same, in addition to the start phoneme /A/, a common branch corresponding to the second phoneme /I/ is also made. Furthermore, for the words “ice” and “icy”, since the third phoneme /S/ thereof is the same, a common branch corresponding to the third phoneme /S/ thereof, in addition to the start phoneme /A/ and the second phoneme /I/, is also made.
Furthermore, for the words “be” and “beat”, since the first phoneme /B/ thereof and the second phoneme /I/ thereof are the same, common branches corresponding to the start phoneme /B/ and the second phoneme /I/ are made.
In a case where the word dictionary which forms the word network of FIG. 2 is used, the matching section 4 reads, from the sound model database 5, an acoustic model of phonemes corresponding to a series of branches extending from the root node of the word network, connects them, and calculates, based on the connected acoustic model, an acoustic score by using the series of features from the feature extraction section 3.
Consequently, the acoustic scores of the words “I”, “ice”, “icy”, and “up” are calculated in a common manner for the first phoneme /A/ thereof. Also, the acoustic scores of the words “I”, “ice”, and “icy” are calculated in a common manner for the first and second phonemes /A/ and /I/. In addition, the acoustic scores of the words “ice” and “icy” are calculated in a common manner for the first to third phonemes /A/, /I/, and /S/. For the remaining phoneme (second phoneme) /P/ of the word “up” and the remaining phoneme (fourth phoneme) /I/ of the word “icy”, the acoustic score is calculated separately.
The acoustic scores of the words “be” and “beat” are calculated in a common manner for the first and second phonemes /B/ and /I/ thereof. Then, for the remaining phoneme (third phoneme) /T/ of the word “beat”, the acoustic score is calculated separately.
Consequently, by using the word dictionary which forms the word network, the amount of calculations of acoustic scores can be greatly reduced.
In the matching section 4, in the manner described above, when acoustic scores are calculated using a series of features on the basis of acoustic models which are connected along a series of branches (hereinafter referred to as a “path” where appropriate) extending from the root node of the word network, eventually, the end node (in FIG. 2, the end of the final branch in a case where movement occurs from the root node to the right along the branches) of the word network is reached. That is, for example, in a case where an HMM is used as an acoustic model, when acoustic scores are calculated using the series of features on the basis of the HMMs connected along the series of branches which form the path, there is a time when the acoustic score becomes large to a certain degree (hereinafter referred to as a “local maximum time” where appropriate) in the final state of the connected HMMs.
In this case, in the matching section 4, it is assumed that the region from the time of the features at the start, used for the calculation of the acoustic scores, to the local maximum time is a speech region in which a word corresponding to the path is spoken, and the word is assumed to be a candidate for a word which is a constituent of the word sequence as the speech recognition result. Then, based on the acoustic models connected along the series of the branches (path) extending from the root node of the word network, the calculations of the acoustic scores of the candidate for the word which is connected after the candidate of that word are performed again using the series of features after the local maximum time.
In the matching section 4, as a result of the above processing being repeated, a word sequence as a candidate of a large number of speech recognition results is obtained. The matching section 4 discards words with a low acoustic score among the candidates of such a large number of word sequences, that is, performs acoustic pruning, thereby selecting (leaving) only a word sequence whose acoustic score is equal to or greater than a predetermined threshold value, that is, only a word sequence which has a certain degree of certainty, from an acoustic point of view, as a speech recognition result, and the processing continues.
In addition, in the process in which a candidate of a word sequence as a speech recognition result is created while calculating the acoustic score in the manner described above, the matching section 4 calculates the linguistic score of a word which is a constituent of the candidates of the word sequence as a speech recognition result, on the basis of the grammar rule, such as N-gram, entered in the grammar database 7. Then, the matching section 4 discards words having a low-acoustic score, that is, performs linguistic pruning, thereby selecting (leaving) only a word sequence whose linguistic score is equal to or greater than a predetermined threshold value, that is, only a word sequence which has a certain degree of certainty, from a linguistic point of view, as a speech recognition result, and the processing continues.
As described above, the matching section 4 calculates the acoustic score and the linguistic score of a word, and performs acoustic and linguistic pruning on the basis of the acoustic score and the linguistic score, thereby selecting one or more word sequences which seem likely as a speech recognition result. Then, by repeating the calculations of the acoustic score and the linguistic score of a word connected after the connected word sequence, eventually, one or more word sequences which have a certain degree of certainty is obtained as a candidate of the speech recognition result. Then, the matching section 4 determines, from among such word sequences, a word sequence having the greatest final score, for example, as shown in equation (1), as the speech recognition result.
In the speech recognition apparatus, the number of words, as the object of speech recognition, to be entered in the word dictionary of the dictionary database 6 is limited, for example, due to the computation speed of the apparatus, the memory capacity, etc.
When the number of words as the object of speech recognition is limited, various problems occur if a user speaks a word which is not the object of speech recognition (hereinafter referred to as an “unknown word” where appropriate).
More specifically, in the matching section 4, even when an unknown word is spoken, the acoustic score of each word entered in the word dictionary is calculated using the features of the speech of the unknown word, and a word whose acoustic score is large to a certain degree is erroneously selected as a candidate of the speech recognition result of the unknown word.
As described above, when an unknown word is spoken, an error occurs at the portion of that unknown word, and furthermore, this error may cause an error at other portions.
More specifically, for example, in the manner described above, in a case where the user speaks “ (I want to go to New York)”, when “ (New York)” is an unknown word, since an erroneous word is selected in the portion “ (New York)”, it is difficult to precisely determine the boundary between “ (New York)”, which is an unknown word, and the word “ (to)” which follows. As a result, an error occurs at the boundary between the words and this error affects the calculation of the acoustic score of the other portions.
Specifically, in the manner described above, after an erroneous word, which is not “ (New York)”, is selected, the acoustic score of the next word is calculated using the series of features in which the end point of the series of features, used for the calculation of the acoustic score of that erroneous word, is a starting point. Consequently, the calculation of the acoustic score is performed, for example, using the features of the end portion of the speech “ (New York)”, or is performed without using the features of the initial portion of the next speech “ (to)”. As a result, there are cases in which the acoustic score of the correct word “ (to)” as the speech recognition result becomes smaller than that of the other words.
In addition, in this case, even if the acoustic score of the word which was wrongly recognized as the speech recognition result does not become very large, the linguistic score of the word becomes large. As a result, there are cases in which the score when the acoustic score and the linguistic score are jointly evaluated becomes greater than the score when the acoustic score and the linguistic score of the correct word “ (to)” as the speech recognition result are jointly evaluated (hereinafter referred to as a “word score” where appropriate).
As described above, as a result of making a mistake in the speech recognition of the unknown word, the speech recognition of a word at a position close to the unknown word is also performed mistakenly.
As a word which is the object of speech recognition in the speech recognition apparatus, generally, for example, a word with a high appearance incidence in newspapers, novels, etc., is often selected, but there is no guarantee that a word with a low appearance incidence will not be spoken by a user. Therefore, it is necessary to somehow cope with an unknown word.
An example of a method for coping with an unknown word, is one in which, for example, an unknown word, which is a word which is not the object of speech recognition, is divided into segments, such as sound elements which form the word or a sound element sequence composed of several sound elements, and this segment is considered as a word in a pseudo manner (what is commonly called a “sub-word”) so that the word is made an object of speech recognition.
Since there are not very large number of types of sound elements which form a word and sound element sequences, even if such sound elements and sound element sequences are made objects of speech recognition as pseudo-words, this does not exert a very large influence on the amount of calculations and the memory capacity. In this case, the unknown word is recognized as a series of pseudo-words (hereinafter referred to as “pseudo-words” where appropriate), and as a result, the number of unknown words apparently becomes zero.
In this case, even if not only an unknown word, but also a word entered in the word dictionary is spoken, it can be recognized as a series of pseudo-words. Whether the spoken word will be recognized as a word entered in the word dictionary or as an unknown word as a series of pseudo-words, is determined based on the score calculated for each word.
However, in a case where a pseudo-word is used, since the unknown word is recognized as sound elements which are a pseudo-word or a series of sound element sequences, the unknown word cannot be processed by using an attribute thereof. That is, for the unknown word, since, for example, the part of speech as the attribute thereof cannot be known, the grammar rule cannot be applied, causing the speech recognition accuracy to be degraded.
Also, there are some types of speech recognition apparatuses in which the word dictionary for each of a plurality of languages is prestored in the dictionary database 6, and the word dictionary is, for example, switched according to an operation by a user so that speech recognition of a plurality of languages is made possible. In this case, the words of the languages other than the language of the word dictionary which is currently used become unknown words; however, if the language, as the attribute, of the unknown word is known, it is possible to automatically switch to the word dictionary of that language, and furthermore, in this case, the word which was an unknown word can be recognized correctly.
Specifically, for example, in a case where English and French word dictionaries are stored in the dictionary database 6, when the English word dictionary is in use, if it is known that the unknown word is a French word, considering that the speaker changed to a French person, the word dictionary may be switched to the French dictionary from the English dictionary, so that speech recognition with a higher accuracy is made possible.