A common problem in speech signal processing is the enhancement of a speech signal from its noisy measurement. One approach for speech enhancement based on single channel (microphone) measurements is filtering in the frequency domain applying spectral subtraction techniques, 1!, 2!. Under the assumption that the background noise is long-time stationary (in comparison with the speech) a model of the background noise is usually estimated during time intervals with non-speech activity. Then, during data frames with speech activity, this estimated noise model is used together with an estimated model of the noisy speech in order to enhance the speech. For the spectral subtraction techniques these models are traditionally given in terms of the Power Spectral Density (PSD), that is estimated using classical FFT methods.
None of the abovementioned techniques give in their basic form an output signal with satisfactory audible quality in mobile telephony applications, that is
1. non distorted speech output PA1 2. sufficient reduction of the noise level PA1 3. remaining noise without annoying artifacts
In particular, the spectral subtraction methods are known to violate 1 when 2 is fulfilled or violate 2 when 1 is fulfilled. In addition, in most cases 3 is more or less violated since the methods introduce, so called, musical noise.
The above drawbacks with the spectral subtraction methods have been known and, in the literature, several ad hoc modifications of the basic algorithms have appeared for particular speech-in-noise scenarios. However, the problem how to design a spectral subtraction method that for general scenarios fulfills 1-3 has remained unsolved.
In order to highlight the difficulties with speech enhancement from noisy data, note that the spectral subtraction methods are based on filtering using estimated models of the incoming data. If those estimated models are close to the underlying "true" models, this is a well working approach. However, due to the short time stationarity of the speech (10-40 ms) as well as the physical reality surrounding a mobile telephony application (8000 Hz sampling frequency, 0.5-2.0 s stationarity of the noise, etc.) the estimated models are likely to significantly differ from the underlying reality and, thus, result in a filtered output with low audible quality.
EP, A1, 0 588 526 describes a method in which spectral analysis is performed either with Fast Fourier Transformation (FFT) or Linear Predictive Coding (LPC).