The present invention is in the field of audio signal processing and, particularly, in the field of speech enhancement of audio signals, so that a processed signal has speech content, which has an improved objective or subjective speech intelligibility.
Speech enhancement is applied in different applications. A prominent application is the use of digital signal processing in hearing aids. Digital signal processing in hearing aids offers new, effective means for the rehabilitation of hearing impairment. Apart from higher acoustic signal quality, digital hearing-aids allow for the implementation of specific speech processing strategies. For many of these strategies, an estimate of the speech-to-noise ratio (SNR) of the acoustical environment is desirable. Specifically, applications are considered in which complex algorithms for speech processing are optimized for specific acoustic environments, but such algorithms might fail in situations that do not meet the specific assumptions. This holds true especially for noise reduction schemes that might introduce processing artifacts in quiet environments or in situations where the SNR is below a certain threshold. An optimum choice for parameters of compression algorithms and amplification might depend on the speech-to-noise ratio, so that an adaption of the parameter set depending on SNR estimates help in proving the benefit. Furthermore, SNR estimates could directly be used as control parameters for noise reduction schemes, such as Wiener filtering or spectral subtraction.
Other applications are in the field of speech enhancement of a movie sound. It has been found that many people have problems understanding the speech content of a movie, e.g., due to hearing impairments. In order to follow the plot of a movie, it is important to understand the relevant speech of the audio track, e.g. monologues, dialogues, announcements and narrations. People who are hard of hearing often experience that background sounds, e.g. environmental noise and music are presented at a too high level with respect to the speech. In this case, it is desired to increase the level of the speech signals and to attenuate the background sounds or, generally, to increase the level of the speech signal with respect to the total level.
A prominent approach to speech enhancement is spectral weighting, also referred to as short-term spectral attenuation, as illustrated in FIG. 3. The output signal y[k] is computed by attenuating the sub-band signals X(ω) of the input signals x[k] depending on the noise energy within the sub-band signals.
In the following the input signal x[k] is assumed to be an additive mixture of the desired speech signal s[k] and background noise b[k].x[k]=s[k]+b[k].  (1)
Speech enhancement is the improvement in the objective intelligibility and/or subjective quality of speech.
A frequency domain representation of the input signal is computed by means of a Short-term Fourier Transform (STFT), other time-frequency transforms or a filter bank as indicated at 30. The input signal is then filtered in the frequency domain according to Equation 2, whereas the frequency response G(ω) of the filter is computed such that the noise energy is reduced. The output signal is computed by means of the inverse processing of the time-frequency transforms or filter bank, respectively.Y(ω)=G(ω)X(ω)  (2)
Appropriate spectral weights G(ω) are computed at 31 for each spectral value using the input signal spectrum X(ω) and an estimate of the noise spectrum {circumflex over (B)}(ω) or, equivalently, using an estimate of the linear sub-band SNR {circumflex over (R)}(ω)=Ŝ(ω)/{circumflex over (B)}(ω). The weighted spectral value are transformed back to the time domain in 32. Prominent examples of noise suppression rules are spectral subtraction [S. Boll, “Suppression of acoustic noise in speech using spectral subtraction”, IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 27, no. 2, pp. 113-120, 1979] and Wiener filtering. Assuming that the input signal is an additive mixture of the speech and the noise signals and that speech and noise are uncorrelated, the gain values for the spectral subtraction method are given in Equation 3.
                              G          ⁡                      (            ω            )                          =                              1            -                                                                                                                        B                      ^                                        ⁡                                          (                      ω                      )                                                                                        2                                                                                                    X                    ⁡                                          (                      ω                      )                                                                                        2                                                                        (        3        )            
Similar weights are derived from estimates of the linear sub-band SNR R(ω) according to Equation 4.
Channel
                              G          ⁡                      (            ω            )                          =                                                            R                ^                            ⁡                              (                ω                )                                                                                      R                  ^                                ⁡                                  (                  ω                  )                                            +              1                                                          (        4        )            
Various extensions to spectral subtraction have been proposed in the past, namely the use of an oversubtraction factor and spectral floor parameter [M. Berouti, R. Schwartz, J. Makhoul, “Enhancement of speech corrupted by acoustic noise”, Proc. of the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, 1979], generalized forms [J. Lim, A. Oppenheim, “Enhancement and bandwidth compression of noisy speech”, Proc. of the IEEE, vol 67, no. 12, pp. 1586-1604, 1979], the use of perceptual criteria (e.g. N. Virag, “Single channel speech enhancement based on masking properties of the human auditory system”, IEEE Trans. Speech and Audio Proc., vol. 7, no. 2, pp. 126-137, 1999) and multi-band spectral subtraction (e.g. S. Kamath, P. Loizou, “A multi-band spectral subtraction method for enhancing speech corrupted by colored noise”, Proc. of the IEEE Int. Conf. Acoust. Speech Signal Processing, 2002). However, the crucial part of a spectral weighting method is the estimation of the instantaneous noise spectrum or of the sub-band SNR, which is prone to errors especially if the noise is non-stationary. Errors of the noise estimation lead to residual noise, distortions of the speech components or musical noise (an artefact which has been described as “warbling with tonal quality” [P. Loizou, Speech Enhancement: Theory and Practice, CRC Press, 2007]).
A simple approach to noise estimation is to measure and averaging the noise spectrum during speech pauses. This approach does not yield satisfying results if the noise spectrum varies over time during speech activity and if the detection of the speech pauses fails. Methods for estimating the noise spectrum even during speech activity have been proposed in the past and can be classified according to P. Loizou, Speech Enhancement: Theory and Practice, CRC Press, 2007 as                Minimum tracking algorithms        Time-recursive averaging algorithms        Histogram based algorithms        
The estimation of the noise spectrum using minimum statistics has been proposed in R. Martin, “Spectral subtraction based on minimum statistics”, Proc. of EUSIPCO, Edingburgh, UK, 1994. The method is based on the tracking of local minima of the signal energy in each sub-band. A non-linear update rule for the noise estimate and faster updating has been proposed in G. Doblinger, “Computationally Efficient Speech Enhancement By Spectral Minima Tracking In Subbands”, Proc. of Eurospeech, Madrid, Spain, 1995.
Time-recursive averaging algorithms estimate and update the noise spectrum whenever the estimated SNR at a particular frequency band is very low. This is done by computing recursively the weighted average of the past noise estimate and the present spectrum. The weights are determined as a function of the probability that speech is present or as a function of the estimated SNR in the particular frequency band, e.g. in I. Cohen, “Noise estimation by minima controlled recursive averaging for robust speech enhancement”, IEEE Signal Proc. Letters, vol. 9, no. 1, pp. 12-15, 2002, and in L. Lin, W. Holmes, E. Ambikairajah, “Adaptive noise estimation algorithm for speech enhancement”, Electronic Letters, vol. 39, no. 9, pp. 754-755, 2003.
Histogram-based methods rely on the assumption that the histogram of the sub-band energy is often bimodal. A large low-energy mode accumulates energy values of segments without speech or with low-energy segments of speech. The high-energy mode accumulates energy values of segments with voiced speech and noise. The noise energy in a particular sub-band is determined from the low-energy mode [H. Hirsch, C. Ehrlicher, “Noise estimation techniques for robust speech recognition”, Proc. of the IEEE Int. Conf on Acoustics, Speech, and Signal Processing, ICASSP, Detroit, USA, 1995]. For a comprehensive recent review it is referred to P. Loizou, Speech Enhancement: Theory and Practice, CRC Press, 2007.
Methods for the estimation of the sub-band SNR based on supervised learning using amplitude modulation features are reported in J. Tchorz, B. Kollmeier, “SNR Estimation based on amplitude modulation analysis with applications to noise suppression”, IEEE Trans. On Speech and Audio Processing, vol. 11, no. 3, pp. 184-192, 2003, and in M. Kleinschmidt, V. Hohmann, “Sub-band SNR estimation using auditory feature processing”, Speech Communication: Special Issue on Speech Processing for Hearing Aids, vol. 39, pp. 47-64, 2003.
Other approaches to speech enhancement are pitch-synchronous filtering (e.g. in R. Frazier, S. Samsam, L. Braida, A. Oppenheim, “Enhancement of speech by adaptive filtering”, Proc. of the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, Philadelphia, USA, 1976), filtering of Spectro Temporal Modulation (STM) (e.g. in N. Mesgarani, S. Shamma, “Speech enhancement based on filtering the spectro-temporal modulations”, Proc. of the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, Philadelphia, USA, 2005), and filtering based on a sinusoidal model representation of the input signal (e.g. J. Jensen, J. Hansen, “Speech enhancement using a constrained iterative sinusoidal model”, IEEE Trans. on Speech and Audio Processing, vol. 9, no. 7, pp. 731-740, 2001).
The methods for the estimation of the sub-band SNR based on supervised learning using amplitude modulation features as reported in J. Tchorz, B. Kollmeier, “SNR Estimation based on amplitude modulation analysis with applications to noise suppression”, IEEE Trans. On Speech and Audio Processing, vol. 11, no. 3, pp. 184-192, 2003, and in M. Kleinschmidt, V. Hohmann, “Sub-band SNR estimation using auditory feature processing”, Speech Communication: Special Issue on Speech Processing for Hearing Aids, vol. 39, pp. 47-64, 200312, 13 are disadvantageous in that two spectrogram processing steps are needed. The first spectrogram processing step is to generate a time/frequency spectrogram of the time-domain audio signal. Then, in order to generate the modulation spectrogram, another “time/frequency” transform is needed, which transforms the spectral information from the spectral domain into the modulation domain. Due to the inherent systematic delay and the time/frequency resolution issue inherent to any transform algorithm, this additional transform operation incurs problems.
An additional consequence of this procedure is that noise estimates are quite non-accurate in conditions where the noise is non-stationary and where various noise signals may occur.