FIG. 1 shows a typical conventional speech recognition apparatus.
The speech uttered by a user is input to a microphone 1, which then converts the input speech into speech signals as electrical signals. These speech signals are fed to an A/D (analog/digital) converter 2, which then samples and quantizes the speech signals, as analog signals, output from the microphone 1, to convert the signals into speech data as digital signals. These speech data are sent to a characteristic value extraction unit 3.
The characteristic value extraction unit 3 acoustically processes the speech data from the A/D converter 2, from one suitably selected frame to another, to extract characteristic values, such as MFCC (Mel Frequency Cepstrum Coefficient), to send the extracted values to a matching unit 4. In addition, the characteristic value extraction unit 3 is able to extract other characteristic values, such as spectrum, linear prediction coefficients or linear spectral pairs.
Using the characteristic values from the characteristic value extraction unit 3, the matching unit 4 speech-recognizes the speech input to the microphone 1 (input speech) based on, for example, the continuous distribution Hidden Markov Model, as it references an acoustic model database 5, a dictionary database 6 and a grammar database 7 as necessary.
That is, the acoustic model database 5 memorizes an acoustic model, representing acoustic features, such as each phoneme or syllable in the language of the speech being recognized. Since here the speech recognition is based on the continuous distribution Hidden Markov Model method, the acoustic model used is the Hidden Markov Model. The dictionary database 6 memorizes a word dictionary stating the information on the pronunciation (phonemic information) for each word (vocabulary) being recognized. The grammar database 7 memorizes a set of grammatical rules (language models) stating how the words registered in the word dictionary of the dictionary database 6 are linked together. As the set of the grammatical rules, those rules which are based on the context free grammar (CFG) or a statistic word link probability (N-gram), for example, may be used.
The matching unit 4 references the word dictionary of the dictionary database 6 to connect to the acoustic model stored in the acoustic model database 5 to construct an acoustic model of the word (word model). The matching unit 4 also references the grammatical rules stored in the grammar database 7 to couple several word models and, using the so-connected word models, recognizes the speech input to the microphone 1, based on the characteristic values, in accordance with the continuous distribution Hidden Markov Model method. That is, the matching unit 4 detects a sequence of word models, having the maximum score (likeliness) of observation of the characteristic values of the time sequence output by the characteristic value extraction unit 3 and outputs a sequence of words corresponding to the sequence of the word models as the recognized results of the speech.
Specifically, the matching unit 4 cumulates the probability of occurrences of the respective characteristic values for a word sequence corresponding to the coupled word models. These cumulated values are scores, and a word sequence, which maximizes the score, is output as the result of word recognition.
The score is calculated in general by comprehensively evaluating an acoustic score accorded by an acoustic model memorized in an acoustic model database 5 and by a language model accorded by the set of grammatical-rules memorized in the grammar database 7.
That is, if, for example, the Hidden Markov Model method is applied, the acoustic model is calculated, from word to word, from the acoustic model forming the word model, based on the probability of observation (probability of occurrence) of the sequence off characteristic values output by the feature extraction unit 3. If a bi-gram is applied, the language score is found based on the probability of concatenation (coupling) of a word under consideration and a word directly previous thereto. The result of speech recognition is finalized based on the final score obtained on comprehensive evaluation of the acoustic score and the language score for each word.
Specifically, if, with a kth word wk in a word sequence made up of N words, the acoustic score of the word wk is expressed as A(wk) and the language score is expressed as L(wk), the last score S of the word sequence is calculated in accordance with the equation (1):S=Ó(A(wk)+Ck×L(wk))  (1)where Ó means taking a sum as k is changed from 1 to N and Ck denotes a weighting to be applied to the language score L(wk) of the word wk.
The matching unit 4 effects matching processing of finding N which maximizes the final score shown in the equation 1 and of finding word sequences w1, w2, . . . wN. These word sequences w1, w2, . . . wN are output as the result of word recognition.
The result of the above processing is that, if a user has uttered e.g., “” (“I would like to go to New York”, uttered as “new york ni ikitai desu”), the speech recognition device of FIG. 1 accords the acoustic and language scores to respective words, such as “” (“New York”, uttered as “new york”), “” (“to”, uttered as “ni”), “” (“would like to go”, uttered as “ikitai”) and “” (uttered as “desu”). If final score obtained on comprehensive evaluation is maximum, the word sequences “”, “”, “”, “” are output as the result of the speech recognition.
It should be noted that if, in the above case, five words of “”, “”, and “” are registered in a word dictionary of the dictionary database 6, there are 55 possible arrays of these five words that can be formed by these five words. Thus, in simple terms, the matching unit 4 has to evaluate these 55 word sequences, to determine such a word sequence which is most suited to the enunciation made by the user, that is, such a word sequence which maximizes the final score. If the number of words registered in a word dictionary is increased, the number of the possible arrays of words, the number of which corresponds to the increased number of registered words, is equal to the number of words having an exponential equal to the number of words, so that the number of the word sequences to be evaluated is extravagant.
Moreover, since the number of words contained in the enunciation is unknown, not only the word sequence made up of five words but also word sequences made up of one, two, . . . words need to be evaluated. So, the number of the word sequences to be evaluated is further increased. It is, therefore, a crucial task to make efficient determination of that one of the extravagant number of the word sequences which is most probable as the result of the speech recognition from the viewpoint of the volume of calculations and the memory capacity to be used.
Among the methods for improving the efficiency in the volume of calculations and the memory capacity, there are an acoustic truncating method of truncating the score calculations when the acoustic score found in the course of finding the acoustic score falls below a pre-set threshold value, and a linguistic truncating method of wine-pressing the words as the object of score calculations based on the language score.
With this truncating method, the objects of the score calculations are wine-pressed based on a pre-set standard for judgment, such as the acoustic score in the course of the calculations as described above or the language score accorded to each word, to diminish the volume of the calculations. However, if the standard for judgment is too severe, even correct results of the speech recognitions are truncated to cause mistaken recognitions. Therefore, if the truncation method is applied, wine-pressing needs to be performed with a pre-set margin such as to prohibit truncation of correct results of speech recognition. The result is that it is difficult to diminish the volume of the calculations significantly.
If, in finding an acoustic score, the acoustic score is found for the totality of words to be calculated, the processing volume is increased. In this consideration, such a method has been proposed to use a certain portion of the calculations of the acoustic scores in common for plural words. As a method for this co-owning method, it is known to use an acoustic model in common for those words in a word dictionary having the same leading phoneme, as from the leading phoneme up to the last common phoneme and to use individual acoustic models as from the phoneme next following the last common phoneme to construct a sole tree structure network to find the acoustic score using this network. Specifically, with words “” (“autumnal field”, uttered as “akita”) and “” (“dawn”, uttered as “akebono”), with the phonemic information of “” being [akita] and that of “” is [akebono], the acoustic score of “” and “” can be calculated in common up to the second phoneme as from the first phoneme a, k. As for the remaining phonemes k, i, t and a of the word “” and the remaining phonemes e, b, o, n and o of the word “”, the acoustic score is calculated independently.
So, with this method, the processing volume for the acoustic score can be diminished significantly.
With this method, it is not possible to determine a word, the acoustic model of which is being calculated, from the common word portion for which an acoustic score is calculated in common. In the above example of the words “” and “”, if the acoustic score is calculated for the first and second phonemes a and k, it is not possible to identify whether the word, the acoustic model of which is being calculated, is “” or “”.
In this case, for “”, the word being processed can be identified to be “” when the calculations of the acoustic score are started for the third phoneme. Similarly, for “”, the word being processed can be identified to be “” when the calculations of the acoustic score are started for the third phoneme.
Thus, if a part of the calculations of the acoustic score is used in common, each word cannot be identified at the beginning of the calculations of the acoustic score for the word, thus the language score cannot be considered for the word. As a result, it is difficult to use the above-mentioned linguistic truncating method before starting the calculations of the acoustic score for the word, and unnecessary calculations will be done.
Further, if a part of the calculations of the acoustic score is used in common, the above-described network of the tree structure is formed for the totality of words in a word dictionary, and hence a large memory capacity is required for holding the network.
For improving the efficiency of the memory capacity and the processing volume, there is known a method of preliminarily selecting the words the acoustic score of which is to be calculated, without calculating the acoustic score of the totality of words in a dictionary, and to calculate the acoustic score only for the preliminarily selected words.
The method for preliminary selection is stated in, for example, L. R. Bahl, S. V. De Gennaro, P. S. Gopalakrishnan and R. L. Mercer, “A Fast Approximate Acoustic match for large Vocabulary Speech Recognition”, IEEE Trans. Speech and Audio Proc., vol. 1, pp. 59-67, 1993.
This preliminary selection is performed using simpler acoustic models or a set of grammatical rules not particularly high in precision. That is, the preliminary selection is performed for the totality of words in the word dictionary, so that, if the preliminary selection is performed using acoustic models or a set of grammatical rules high in precision, a large amount of resources, such as processing volume or memory capacities, are required for maintaining real-time operations. With the preliminary selection, employing a simplified acoustic model or set of grammatical rules, high-speed processing is possible with a smaller amount of resources, if the large vocabulary is to be dealt with.
In the speech recognition apparatus, in which the preliminary selection is applied, it is sufficient if the matching processing is performed only for the pre-selected words, so that, even in case acoustic models or set of grammatical rules high in precision are used, the matching processing can be carried out speedily with a small amount of resources. Thus, the speech recognition apparatus, performing preliminary selection, is particularly useful in speech recognition for a large vocabulary.
Meanwhile, the preliminary selection is performed after a terminal point likely to be true is found on completion of the matching processing employing a sequence of characteristic values for a given word, using a sequence of characteristic values as from the time point corresponding to the terminal point which is now a start point. That is, the preliminary selection is performed at a time point when the boundary between words contained in the continuously uttered speech has as yet not been finalized.
Therefore, if the beginning end point or the terminal end point of a sequence of characteristic values used in the preliminary selection is offset from the beginning end point or the terminal end point of a word in question, such a preliminary selection is carried out which uses a sequence of characteristic values containing characteristic values of a word directly preceding or directly following the word in question, or a sequence of characteristic values lacking in the characteristic values of beginning end or trailing end portions of the word in question, that is, using what may be termed an acoustically unstable sequence of characteristic values.
Thus, in the preliminary selection, employing a simple acoustic model, it may be an occurrence that a certain word contained in the speech is not selected. Such failure is selection is likely to occur in words with a smaller number of phonemes, such as adjuvant or adjuvant verb in Japanese or articles or prepositions in English.
If correct words have not been selected in the preliminary selection, no matching processing is carried out for the word, so that the result of speech recognition is in error.
There are such methods as moderating the standard for acoustic or linguistic judgment in word selection to increase the number of selected words, and employing an acoustic model or a set of grammatical rules high in precision.
However, if, in the preliminary selection, the standard for acoustic or linguistic judgment in word selection is moderated, a large number of words not particularly high in the probability as the result of the speech recognition become the object of the matching processing, thus significantly increasing the resource necessary for matching processing heavier in load per word than the preliminary selection.
On the other hand, if an acoustic model or a set of grammatical rules high in precision is used in the preliminary selection, the resource necessary for preliminary selection is increased significantly.