This invention relates to a speech-recognition system, in particular to a speech-recognition system using pattern-matching technology.
In preforming speech recognition, it is extremely important, for the sake of improving the recognition performance, to extract stably and accurately the characteristics of the steady vowel portions of the input utterance. Among so for the following reasons. That is, the speech sounds uttered by human beings, the steady vowel portions occupy a larger percentage of time than the transitory (non-steady) portions, i. e. the parts changing from consonants or vowels to vowels, or from vowels to consonants. Moreover, since the steady vowel portions have a relatively long duration, they undergo little variation as a result of factors such as the timing of enunciation, and their characteristics can thus be extracted stably. Therefore, a system of recognition using chiefly the characteristics of the steady vowel portions is most effective.
The technology of extracting local peaks has been proposed as one which would be effective when used in the devices of the prior art for extracting the characteristics of the steady vowel portions. This technology aims at detecting the formant positions of the steady vowel portions.
FIGS. 1A to 1C are explanatory drawings of this technology. With this technology, input speech signals are converted from analog to digital signals and are then subjected to frequency analysis by means of band-pass filters with different central frequencies (the central frequencies are numbered by assigning channel numbers k, in which k is a positive integer, to channels corresponding to each central frequency) and to logarithmic conversion in sequence at prescribed time intervals (hereinafter called "frames"). Then the frequency spectra obtained in this way are calculated (FIG. 1A), and the spectra are normalized by subtracting from these frequency spectra and the least square fit line of the spectra (FIG. 1B). The local peak patterns are then extracted in terms of one-bit characteristic quantities. This is, among the channels which have a normalized spectra value larger than 0, the channels which have maximum output signal values are assigned local peak values of 1, ; and the remaining channels are all assigned local peak values of 0. (FIG. 1C)
Then the degrees of similarity are calculated between the local peak patterns extracted in the manner described above and reference patterns which have been prepared in advance. The similarity is calculated for each category to be recognized, and the name of the category giving the largest degree of similarity among all the categories to be recognized is output as the result of the recognition.
Local peaks are characteristics by which the formant zones of the steady vowel portions are extracted, and the latter can be recognized by them with a high stability.
However, it is difficult to extract stably the characteristics of the consonant portions, for example, of fricatives such as /s/ or /ch/. This is so for the following reason. The local peak technology is a technology for extracting the zones in which the normalized spectra reach their maxima. In steady vowel portions, if the formant, which is the main characteristics of the steady vowel portions, is clear and stable then the maximum channel which corresponds to the formant can be derived stably. On the other hand, in the consonant portions such as fricatives, the formant is not clear. Therefore, the positions where the local peaks appear in the consonant portions are unstable and cannot easily be established unequivocally.
Consequently, there has been the following problem. It is difficult, by means of the local peak technology alone, to recognize and identify two utterances both of which have the same steady vowel portions, such as "ichi" (a Japanese word meaning "one") and "shichi" (a Japanese word meaning "seven"), because the characteristics of the consonant parts cannot be extracted stably by this method. As a result, this has led to a deterioration of the recognition performance.