1. Field of the Invention
The present invention relates to a system for and methods of speech recognition which permit utterances to be recognized with high accuracy.
2. Description of the Related Art
Recently, a speech recognition system has achieved success which employs the HMM (hidden markov model) which transforms an utterance into a sequence of certain symbols (this transformation is referred to as vector quantization) and then models the utterance as the transition of the symbol sequence. A table that is referred to in transforming the utterance into symbols is called a phonetic segment (PS) table. The HMM is represented by a transition network having more than one state, in which, for each state, the probability of occurrence of each symbol and the inter-state transition probability are embedded.
When the PS dictionary uses steady coefficients (for example, spectrum coefficients or cepstrum coefficients), speech events depend only on state information embedded in the HMM (there is no time relationship in one state). For this reason, differential information, such as .DELTA. cepstrum, is introduced. That is, a method is adopted which replaces an utterance with a symbol sequence with not only its spectrum but also its time variations taken into account. However, with the PS dictionary having a large number of dimensions, distortion introduced by quantization will inevitably become very great. For this reason, use is made of two or more PS dictionaries having their numbers of dimensions decreased by dividing the parameter space (in the above example by separating the spectrum and the time variation information).
Besides those methods, there is a method which directly quantizes a sequence of spectra (or cepstrum), i.e., two-dimensional patterns, this method being called matrix quantization. The matrix quantization has, on the one hand, an advantage that speech patterns can directly be handled without approximation and, on the other hand, a drawback that quantization distortion increases. Thus, a method of decreasing the distortion by using a statistical technique at the time of quantization has been proposed.
However, even if those methods are used, distortion introduced by quantizing an utterance still remains great. Thus, a means of further decreasing the distortion is desired. In order to solve distortion problems, it is necessary only that a speech spectrum (or cepstrum) be directly expressed within the HMM without replacing it with symbols (i.e., without quantizing). Such a method is called "continuous HMM" as opposed to "discrete HMM" involving quantization. In general, the continuous HMM needs huge quantities of calculations. The reason is that a covariance matrix corresponding to each state must be obtained from an input vector sequence to the HMM, and then the products of the input vectors and the covariance matrices must be calculated at the time of speech recognition.
When an utterance is expressed by HMM, a phoneme, a syllable, a word, a clause, or a sentence will be considered as its unit. Whatever the unit, it is important that an input utterance and its model agree well with each other at the time of recognition, in other words, distortion be as low as possible. As described above, the best is the continuous HMM which directly enters into the HMM two-dimensional patterns that contain variations of the speech spectra with time. A problem with continuous HMM is that difficulties are involved in putting it to practical use because huge quantities of calculations are needed.