This invention relates to a speech recognition apparatus which recognizes speech signals.
A speech recognition apparatus is known which can recognize several hundred words spoken by a specific person using word template matching with a high reliability. The apparatus compares feature parameter patterns of the input speech with reference parameter patterns previously registered. In a case where the apparatus using word template matching is applied to speaker independent speech recognition or to recognition of thousands of words, it is extremely difficult to change vocaburaries and collect a plentitude of data for constituting word reference patterns. Accordingly, it is desired that a speech recognition apparatus be provided which can recognize phonetic units, for example, phonemes to obtain phoneme sequences and take the symbolic pattern matching of each phoneme sequence thus obtained using a symbolically constructed lexicon, thus recognizing each spoken word. In a case where the phoneme is used as a phonetic unit for speech recognition, theoretically, the apparatus can recognize the speech of any person based on the recognized phoneme string which is obtained using 20 to 30 phonemes. It is therefore extremely important for the apparatus to effect speech analysis and phoneme recognition with a high accuracy.
There are two groups of phonemes, vowels and consonants. A vowel is relatively stable and its duration is long. Its feature hardly changes with time and may clearly appear in a frequency spectrum. By contrast, a consonant quickly changes, and its feature may clearly appear in a dynamic pattern of frequency spectrum. In the known apparatus, an input speech is analyzed for each frame and the acoustic parameter patterns such as frequency spectra for each frame are used as phoneme pattern vectors to recognize the input speech. It can therefore recognize vowels with a high accuracy, but cannot accurately recognize consonants.
Further, a method for recognizing both vowels and consonants is known. In this method, input speech is analyzed, thus providing frequency spectrum patterns for each frame, and the frequency spectrum patterns for two or more frames are used as phoneme pattern vectors. However, in this case, the number of orders of the phoneme pattern vector is increased, and a large number of calculations must be performed in order to recognize phonemes. The number of necessary calculations will be very large, particulary when statistical data processing is carried out to recognize phonemes. Thus, the above-mentioned method is not practicable.