1. Technical Field
The invention relates to a method and a means for performing a robust feature extraction for a speech recognition in a noisy environment.
2. Description of Related Art
In the area of speech recognition a major problem for an accurate recognition of speech occurs in case of a noisy environment. All possible different types of noise have influence on the speech recognition and may degrade a recognition accuracy drastically.
Especially in the area of mobile telephony or access systems that allow access after recognising a spoken password, speech recognition is becoming more and more important. Especially in these areas mentioned, out of the possible different types of noise, the most problematic ones are additive stationary or instationary background noise. Another type of noise degrading the recognition accuracy is the influence of frequency characteristics of a transmission channel if the speech to be recognised is transmitted via such a channel. Additive noise may consist of background noise in combination with noise generated on a transmission line.
Therefore it is known from the prior art to provide a so-called linear or non-linear spectral subtraction. Spectral subtraction is a noise suppression technique, which reduces the effects of additive noise to speech. It estimates the magnitude or power spectrum of clean speech by explicitly subtracting the noise magnitude or power spectrum from the noisy magnitude or power spectrum. Such a technique was developed for enhancing speech in various communication situations.
As spectral subtraction necessitates estimation of the noise during pauses, it is also supposed that noise characteristics change slowly, to guarantee that the noise estimation is still valid. The success of this method necessitates the availability of a robust endpoint or voice activation detector to separate speech from noise. However, a good speech and noise separation is a necessary condition but is difficult to achieve at low Signal-to-Noise Ratio (SNR).
In addition even if spectral subtraction is computationally efficient since the noise is estimated during speech pauses and even if this technique can be implemented as a pre-processing technique leaving the other processing stages unchanged, the performance of the spectral subtraction method is strongly dependant on the noise and how the noise is extracted. The problem associated with this is that even if the wide-band noise is reduced, some noise residual remains (Junqua et al; xe2x80x9cRobustness in automatic speech recognitionxe2x80x9d; Kluwer Academic Publisher; 1996; Section 9.2 Speech Enhancement, pages 277 ff.)
Anyhow, even if with the above mentioned methods the speech recognition may be improved, for these approaches the estimation of the noise characteristics is crucial. As mentioned above, a speech to noise discrimination is needed to mark those segments of a speech signal that contains only noise. But such a discrimination can not be free of errors and is difficult to achieve. In addition to this when it is looked at segments of the speech signal which contain the superposing of speech and stationary noise, these segments can be described by the superposition of corresponding distribution functions for a spectral noise component and a spectral speech component. These distribution functions overlap depending on the SNR. The overlap is higher, the lower the SNR is. And therefore in this case it can not be decided whether short-term spectra contain speech in spectral regions where the spectral magnitude of the speech takes values of the same size or less size than the noise.
The present invention provides a method and an apparatus that overcomes the problems and that allows a more robust speech recognition in noisy environment.
It is advantageous according to the invention that a short term spectrum only containing noise is smoothed and in addition in case of noisy speech segments, unreliable spectral components are interpolated by so called reliable ones, therefore resulting in an improved speech recognition, or more explicitly in a robust feature extraction, supporting an improved speech recognition.
It is advantageous to perform the interpolation based on at least one spectral component of an adjacent short term spectrum and/or at least one in time preceding spectral component, as it could be expected that a so called unreliable speech component with a low probability to contain speech is smoothed.
An improved speech recognition is achieved with taking two adjacent spectral components and one proceeding in time.
A further advantage according to the present invention is to compare the calculated probability to a threshold in order to get a definition which spectral component has to be interpolated.
It is further advantageous to interpolate the spectral component on the basis of noiseless speech.
Two interpolations are performed resulting in an even better speech recognition.
It is advantageous according to the present invention to base the division YYY of the short-term spectra on a MEL frequency range as the MEL frequency range is based on the human ear.
Further it is advantageous to use the method for a speech recognition to control electronic devices, e.g. mobile phones, telephones or access system using speech to allow access or dialling etc.