This invention relates to a speech recognition system and in particular to a speech recognition system using matching techniques.
A speech recognition system using a spectrum matching technique which has been proposed previously will first be described below with reference to FIG. 1, FIG. 2A and FIG. 2B.
Input speech signals D1 which have undergone analog to digital conversion are input to frequency analyzer 10. Frequency analyzer 10 analyzes the frequencies of input signals D1 by means of band pass filters with differing center frequencies (center frequency numbers are hereinafter referred to as channels) and calculates (FIG. 2A) frequency spectrum D2, which has undergone logarithmic conversion, for each fixed time interval (hereinafter referred to as frame). Frequency spectrum D2 is output to spectrum normalizer 11 and to voiced interval detector 12.
From the values of frequency spectrum D2, voiced interval detector 12 determines the start point and the end point, outputting start point signal D3 and end point signal D4 to spectrum normalizer 11.
Spectrum normalizer 11 obtains the normalized spectrum by subtracting the least square fit line for the spectrum from frequency spectrum D2 for each of the frames from the start point to end point (see FIG. 2B), outputting this as normalized spectrum pattern D5 to spectrum similarity calculator 13.
The process described above is repeated for each fixed time interval from the speech start point to the speech end point.
Next, spectrum similarity calculator 13 calculates the similarity between normalized spectrum D5 and each of the reference patterns which have been stored in spectrum reference pattern memory 14, and outputs the spectrum similarity D6 for each recognition category to an identifier 15.
The recognition result output by identifier 15 is the name of the category which contains the reference pattern which gives the largest similarity of all the reference patterns.
The spectrum matching technique in the speech recognition system described above allows differences in vocal chord source characteristics produced by differences between speakers to be absorbed, and is effective in recognizing speech produced by an unspecified speaker.
The above spectrum matching technique extracts the shape of the spectrum for the whole input speech pattern and calculates its similarity with the spectrum reference pattern.
This spectrum matching technique, however, has the following problem. Let us consider categories whose whole patterns have similar shapes of the spectrum, for instance "iie" and "rei" (which are words in the Japanese language). Although there is a clear difference in the positions of the formant frequencies of vowel "i" and vowel "e" within the same utterance, there are parts where the distribution of the positions of the formant frequencies of the two vowels overlap between utterances at different occasions and by different speakers. Therefore when determining the similarity of the utterance with the spectrum reference pattern, which is a standard value for the normalized spectrum information (for instance, formant frequency), it is difficult to accurately distinguish these two vowels. That is to say, there is a problem in that recognition performance is low.
Another problem associated with the prior art spectrum matching is as follows. Let us consider categories whose whole patterns have similar shapes of the spectrum, for instance "ichi" and "ni" (the words for "one" and "two" in Japanese). Although there is a clear difference in that the former has a voiceless period between "i" and "chi" and the latter does not have such a voiceless period, the shapes of the spectrums at the steady vowel parts are similar to each other. If the normalized spectrum output for the voiceless peroid before "chi" of "ichi" (the peroid in which the input signal level is about the same as the background noise and the normalized spectrum output is about the same as the background noise spectrum) is similar to that of "ni", it is difficult to distinguish these two words.