Speech recognition has attracted attention as a user interface which allows anyone to easily input a command by speech. Recently, a speaker-independent speech recognition using a Hidden Markov Model (HMM) has been the mainstream.
Speech recognition in an embedded system, in particular, has a serious problem in terms of processing speed. Speech recognition processing is divided into acoustic analysis for obtaining a speech feature parameter and a process for calculating the likelihood of each recognition target word from the feature parameter by using a decoder. When the number of recognition target words increases or continuous speech recognition is to be performed to recognize a sentence comprising a plurality of words, in particular, a long processing time is required to perform likelihood calculation by using this decoder.
As a widely used method of increasing the recognition processing speed, a technique called beam search is available. In this technique, when likelihood calculation is to be performed time-synchronously, candidates with low likelihoods are excepted at each time of calculation to omit them from subsequent calculation. In general, any candidates that do not reach the value obtained by subtracting a predetermined value from the maximum likelihood within the same time range are excepted.
Other than this method, methods of decreasing the number of candidates have been studied. For example, V. Steinbiss, B. H. Tran, H. Ney, “Improvements in Beam Search”, Proceedings ICSLP, Yokohama, 1994, vol. 4, pp. 2143-2146 has proposed a method of decreasing the number of candidates by setting a imitation on the number of candidates at each time of calculation.
In addition, Japanese Patent Application Laid-Open No. 2002-215187 (corresponding to US2002/128836A1) discloses a technique of decreasing the calculation amount, while maintaining high precision, by performing this candidate count limitation only at a word end without performing it for calculation inside the word.
Furthermore, Japanese Patent Application Laid-Open No. 2001-312293 discloses, as a method of decreasing the calculation amount by devising an acoustic model, a technique of generating a merged phoneme tree by merging similar phonemes, performing likelihood calculation based on this tee, and when a unique solution cannot be obtained, performing collation again with the original phoneme, thereby decreasing the calculation amount. The same reference also discloses a technique of roughly performing likelihood calculation from the word start of a vocabulary to the Nth phoneme by using a rough acoustic model, and accurately performing likelihood calculation for the remaining phonemes by using a precision acoustic model, thereby decreasing the calculation amount.
FIG. 15 shows an example of a tree formed from recognition target words. Referring to FIG. 15, “SIL−s+a” represents a triphone with SIL (silence), s, and a respectively representing a forward phoneme, a central phoneme, and a backward phoneme.
According to Japanese Patent Application Laid-Open No. 2001-312293, calculation near a word start is performed by using a rough model to reduce tree branching, and a solution is determined later by re-collation.
In general, however, tree branching tends to occur many times near a word start. In this case, if a triphone is replaced with a rough model, e.g., a monophone independent of neighboring phonemes, at a position where forward branching often occurs, both SIL−s+u and SIL−s+a become s. As a result, there is no considerable likelihood difference at branches, and the precision of the model deteriorates.
Assume that each reference phoneme pattern of an acoustic model is expressed by a plurality of Gaussian distributions. In this case, if a rough model with a small number of Gaussian distributions is used at a word start, since the phoneme cannot be sufficiently expressed, the likelihood deteriorates, resulting in a deterioration in likelihood calculation precision.
It is therefore necessary to develop another technique of reducing the amount of likelihood calculation while avoiding the above problems and maintaining the precision of the calculation.