This invention relates generally to apparatus for speech recognition, and more particularly to such apparatus which recognizes speech by discriminating phonemes.
In some conventional speech recognition apparatus, a spectrum analyzing portion comprising a filter bank of 29 channels is used to convert an input speech signal into power values of consecutive frames each having a time length of 10 msec or so, so that 29-channel band power level values are obtained. Then local peaks are extracted from the band power levels by way of a local peak extracting portion so as to obtain three values P1, P2, P3 in opposite order of frequency and another three values Pe1, Pe2, Pe3 in order of power for each frame. On the other hand, a segmentation parameter extracting portion is used to extract overall frequency range power, slope of spectrum, low-frequency range and middle-frequency range moments for each frame by using the band power information. Then a segmentation portion determines vowel periods, consonant periods and semivowel periods by using time-dependent variation of parameters obtained by the segmentation parameter extracting portion. A phoneme discriminating portion discriminates phonemes by using the local peak obtained by the local peak extracting portion for respective periods determined by the segmentation portion. Such phoneme discrimination is effected by applying the position of a local peak to a discriminant diagram which is prestored in a discriminant storing portion in advance. Discriminant diagrams are prepared separately for vowels, consonants and semivowels, and one of them is selected by the segmentation portion. The abovementioned P1, P2, P3 values are used for vowels and semivowels, while the Pe1, Pe2, Pe3 values are used for consonants.
A phoneme string producing portion puts the results of phoneme discrimination obtained by the phoneme discrimination portion of respective frames together for respective segmentation periods, and assigns phoneme symbols to respective periods. Segmentation is effected in accordance with the continuity of the results of phoneme discrimination for respective frames when vowels, such as /ao/, /iu/, continue because the segmentation portion cannot perform segmentation in connection with such continued vowels. In this way, the input speech signal is converted into phoneme strings.
A word matching portion then compares the input phoneme strings obtained in this way by the phoneme string producing portion with respective items of a word dictionary by time warping matching so as to output the item of the dictionary which is nearest to the input phoneme strings, as the result of recognition.
There are some phonemes which are difficult to detect by using segmentation parameters extracted by the segmentation parameter extracting portion; this is especially so for nasal sounds, "r" sound and semivowels. Since these phonemes have a high similarity to vowels, it is difficult to detect them by only such parameters.
Another drawback for the conventional technique is that the frame rate of phoneme recognition is low. In the conventional arrangement, the position of local peak is used as a feature parameter, and phoneme recognition is effected by applying the phoneme to discriminant diagrams. Although such method is expected to have a high discrimination rate for vowels and some semivowels, there is a limit on discrimination of consonants.