This invention relates to speech recognition and more particularly to speech recognition systems capable of recognizing the words in speech of unidentified speakers.
Speech recognition systems are well known as a means for analyzing input speech signals and recognizing the content and the speaker of the speech.
One prior art speech recognition system utilizes spoken words as an input speech signal and registers in advance (pre-registration) the spectrum envelope parameters of a plurality of words spoken by a particular speaker as the reference patterns. When spoken words are inputted, the system determines during an analysis frame which stored reference pattern has a spectrum envelope that is in the best conformity with the spectrum envelope of each input spoken word. The closet stored word is selected as the recognized word. The speech recognition system has, in a sense, the function of speech pattern recognition.
In the speech recognition equipment of the type described above, the difficulty in recognition processing varies significantly, depending upon whether the input spoken words are from a specific known speaker or from an unidentified speaker. If the speaker is an unidentified speaker, recognition becomes extremely difficult for several reasons.
First, the acoustic features of the speech can be generally regarded as time frequency patterns that are represented by spectral envelopes having a time variant property, but this time frequency pattern changes from speaker to speaker. Furthermore, even in the case of the same word spoken by the same speaker, its time frequency pattern changes with the time of utterance. The main cause for the change in time frequency pattern is the difference of time changing speeds for each time frequency pattern. Therefore, when the speech of a specific, known speaker is to be recognized, recognition can be accomplished satisfactorily by executing a time normalization. Time normalization extends or contracts, time-wise, either the time change speed of a reference pattern of the words spoken in advance by the specific speaker or the time change speed of the words spoken at the time of analysis. The time wise extension or contraction is made of one signal with respect to the other so that the two signals come into coincidence with each other.
As with the reference pattern parameters described above, time sequence parameters also are pre-registered. Those parameters are to be analyzed in each analysis frame unit, for the full duration time that each of a plurality of words, uttered by the specific speaker, exist. On the other hand, the word uttered by the specific speaker at the time of analysis is analyzed in each analysis frame to extract the time sequence parameter. The patterns are collated by executing time normalization using the "DP technique", which extends or contracts the reference pattern or the current pattern analyzed so that their time change speeds are in the best conformity with each other. Such a DP technique generally provides an excellent recognition ratio and is known as "DP matching".
The "spectral distance" described above represents the spatial distance between various time sequence parameters as "spatial vectors". The matching of spectral distance and patterns, through the use of spatial vectors, is described in detail in Sugamura and Itakura, "Voice Information Compression in Pattern Matching Encoding", a document of Acoustic Research Group. The Acoustic Society, S-79-08, May, 1979.
If the words spoken by an unidentified speaker are to be recognized by utilizing the reference pattern of the specific speaker, the recognition ratio drops significantly. As noted above, the distribution of the spectral envelope parameter of the analyzed speech with respect to time and frequency varies from speaker to speaker and with the time of utterance. Accordingly, the only portion that can be absorbed by the DP matching technique is the change in component relating to the time change speed of the spectral envelope parameter. The portion relating to the frequency distribution resulting from the difference of speakers is not corrected.
Accordingly, there must be a correction made for each identified speaker to compensate for the recognized difference between speakers. The correction of the reference pattern for each specific speaker can provide a high recognition ratio even for the unidentified speakers.
Such a correction can be made, in principle, by use of two processing techniques i.e., time normalization and frequency normalization.
Among these two normalization processes, time normalization determines the non-linear correspondence. which is non-linear time-wise, and occurs even in the same spoken word. The correspondence is determined by extending and contracting the time distribution of the characteristic parameters: such parameters occur in words spoken by both specific and unidentified speakers. In particular, this normalization is conducted in order to identify a mapping function, which enables the analysis pattern and the reference pattern to correspond to each other on the time coordinates. Then, the reference pattern that is capable of producing the mapping function that minimizes the spectral distance by the DP matching technique is selected as the pattern which is in the best conformity with the analysis pattern.
Frequency normalization will normalize the time frequency pattern which changes with different speakers and with the time of utterance. Frequency normalization also will normalize the difference in the vocal chord waveforms by the gradient of the spectral envelope and the difference in the vocal tract length. Frequency normalization is accomplished by extending and contracting the frequency spectral envelope in the axial direction of frequency in order to normalize the difference between speakers with respect to the reference pattern. This technique uses the spectral distance as the measure for evaluation of the analysis pattern by the DP matching technique, with respect to the reference pattern, and selects the reference pattern having the optimum time frequency pattern.
Besides the techniques described above, a recognition function technique, that does not utilize the DP matching technique as speech recognition means for the unidentified speakers, has been considered as relatively effective.
The conventional speech recognition technique of the type described above for unidentified speakers involves several problems.
Processing, involving both time normalization and frequency normalization on the basis of DP matching requires an enormous processing capability. Therefore, though this processing is possible in principle, it is not easy to employ this method.
Although the identification function method has been implemented, approximately one month s processing time is necessary (even by use of a large-scale computer) to preprocess only ten words.
The present invention is directed to providing speech recognition equipment for unidentified speakers which solves the problems described above. It also eliminates the adverse influences encountered due to the differences in speaker word patterns through the use of a three-dimensional, polar coordinate expression of the first to third format frequencies extracted by the optimum analysis degree number. The invention also eliminates the need for pre-registration of spoken words and, hence, can drastically reduce the amount of calculation required in prior art system.