This invention relates to speech recognition systems and more particularly to a noise compensation method employed in speech recognition systems.
Speech recognizers measure the similiarity between segments of unknown and template speech by computing the Euclidean distance between the respective segment parameters. The Euclidean distance, as is known, is the sum of the squares between such parameters. In such systems, by adding noise to either the unknown or template speech or both causes the distance to become either too large or too small and hence produce undesirable results in regard to the speech recognizing capability of the system.
As is known, speech may be represented as a sequence of frequency spectra having different power levels across the frequency spectrum associated with speech. In order to recognize speech, spectra from unknown words are compared with known spectra. The storage of the known spectra occurs in the form of templates or references. Basically, in such systems, unknown speech is processed into, for example, digital signals, and these signals are compared with stored templates indicative of different words.
By comparing the unknown speech with the stored templates, one can recognize the unknown speech and thereby assign a word to the unknown speech. Speech recognition systems are being widely investigated and eventually will enable a user to communicate with a computer or other electronic device by means of speech. A larger problem in regard to speech recognition systems in general is dealing with interfering noise such as background noise as well as the sounds made by a speaker which sounds are not indicative of true words such as lip smacking, tongue clicks, and so on. Other sources of interferences such as background noise as well as other environmental noises produce interfering spectra which prevent the recognition system from operating reliably.
In order to provide recognition with a noise background, the prior art has attempted to implement various techniques. One technique is referred to as noise masking. In this technique, one masks those parts of the spectrum which are due to noise and leaves other parts of the spectrum unchanged. In these systems both the input and template spectra are masked with respect to a spectrum made up of maximum values of an input noise spectrum estimate and a template noise spectrum estimate. In this way, the spectral distance between input and template may be calculated as though the input and template speech signals were obtained in the same noise background. Such techniques have many disadvantages. For example, the presence of high noise level in one spectrum can be cross coupled to mask speech signals in the other spectrum.
These systems require extensive mathematical computation and are therefore extremely expensive, while relatively unreliable. In other techniques proposed in the prior art, one measures the instantaneous signal-to-noise ratio and replaces the noisy distance with a predetermined constant. This substitution has the effect of ignoring information in those frequency intervals where the signal-to-noise ratio is poor. In any event, this creates other problems in that the speech recognition system may ignore unknown speech segments in confusing them as noise or may serve to match a template to a dissimilar unknown speech segment. Hence the above noted approach produces many errors which are undesirable.
It is, therefore, an object of the present invention to provide an improved speech recognition system whereby the noisy distance is replaced with its expected value.
It is a further object to provide a speech recognition system which will reduce the above noted problems associated with prior art systems.
As will be explained, the system according to this invention replaces the noisy distance with its expected value. In this manner incorrect low scores are increased and incorrect high scores are decreased. The procedures according to this invention require no operator intervention nor empirically determined thresholds. The system can be used with any set of speech parameters and is relatively independent of a specific speech recognition apparatus structure.