1. Field of the Invention
The present invention relates to a speech recognition method for recognizing voice.
2. Description of the Related Art
A modeling method using a subword or a triphone smaller than a word is known for speech recognition. Especially, by separating a model such as a triphone depending on its context, a method for separating a model in detail is widely used. An example triphone “SIL−a+k” indicates that, of a sound “a”, an immediately preceding sound is “SIL (silent sound) and an immediately following sound is “k”. With this, more detail modeling is achieved than that achieved with a phoneme “a”, thereby achieving a high recognition rate.
Unfortunately, if a model such as a triphone depending on its context is used, when a plurality of contexts (e.g., word boundaries in continuous word recognition) are present, extending hypotheses corresponding to the number of contexts is needed.
FIG. 5 illustrates likelihood computation with subword series and the corresponding hypotheses according to a recognition grammar allowing repetitive voices of “white (shiro)”, “black (kuro)”, “chestnut (kuri)”, and “red (aka)” to be recognized. A subword 501 shown in FIG. 5A is represented by a triphone composed of a center phoneme and phonemes at left and right context.
The subword 501 is generally modeled through Hidden Markov Model (HMM) having at least one state as shown in FIG. 5B. A hypothesis 502 corresponds to the subword 501 in a state, and, upon likelihood computation, likelihood S (a, b) of each hypothesis is computed. A link 503 links the hypotheses with each other. The likelihood is computed with an output probability of a speech input signal in an HMM state of each hypothesis and a transition probability of the speech input signal transitioning between the states along the links. According to the above-described grammar, the subword 501 depends on a plurality of contexts at the word boundaries of the respective words. Hence, preparing hypotheses corresponding to the number of contexts is needed. More particularly, for left context of subwords (in FIG. 5, “*−sh+i”, “*−k+u”, “*−k+u”, and “*−a+k”) at the front ends of the respective words, extending the subwords and the hypotheses is needed, taking account “SIL” and rear-end phonemes “o”, “o”, “i”, and “a” of the respective words. Also, for right context of subwords (in the figure, “r−o+*”, “r−o+*”, “r−i+*”, and “k−a+*”) at the rear ends of the respective words, extending the subwords and the hypotheses is needed, taking account “SIL” and front end phonemes “sh”, “k”, and “a” of the respective words. FIG. 6 illustrates subword series generated through hypothesis extension of word boundaries, according to the above-described method. As seen from the figure, the subwords and the hypotheses expand at the word boundaries, resulting in an increased time needed for computing likelihoods of the hypotheses expanded to such a large extent.
Japanese Patent Laid-Open No. 05-224692 proposes a countermeasure against this problem. That is, by arranging subwords so as to depend only on context within the words, hypothesis extension at the word boundaries is inhibited. FIG. 7A illustrates subword series formed by making use of a phoneme model at the word boundaries, and FIG. 7B illustrates subword series formed by making use of a model depending only either left or right contexts at the word boundaries. While the hypothesis extension as shown in FIG. 6 is inhibited by making use of such models at the word boundaries, use of models less detailed at the word boundaries than in other word areas results in a lower recognition rate. In view of this problem, Japanese Patent Laid-Open No. 11-045097 proposes a method for generating hypotheses by separating word boundaries from the corresponding words as word-to-word words and linking the hypotheses with each other. However, the hypotheses expand still at the word-to-word words, and this method is advantageous only when the word-to-word word is commonly shared by a large number of words. Also, Japanese Patent Laid-Open No. 2003-208195 (corresponding to US Appl. No. 2005-075876) proposes a method for illustrating subwords of words with a tree structure by arranging an internal state of a context dependent model so as to be commonly shared by the words. However, the hypotheses expand in the internal state still depending on the context subwords, resulting in failure in satisfactorily inhibiting the hypothesis extension.