Research in signal processing systems, methods and algorithms for suppressing or removing noise signals of a noise infected target signal, such as a speech signal, has been on-going for decades. Important objectives of these efforts are to provide an improvement in the perceived sound quality and/or speech intelligibility for the listener. In voice communication apparatuses and systems it is known to represent a noisy speech signal in a time-frequency domain, e.g. as multiple sub-band signals. In many cases it is desirable to apply a time-frequency dependent gain value to the sub-band signals before the signal is reconstructed as a time domain signal. This is done to attenuate the undesired noise signal components that may be present in an audio signal. These time-frequency dependent gain values or time-varying sub-band gain values are sometimes derived from an estimate of the time-frequency dependent ratio of target signal and noise signal. The present multi-band noise reduction system and methodology may comprise processing multiple time-frequency signal-to-noise ratio estimates of respective sub-band signals to improve the sound quality and/or intelligibility of target speech for a listener or user in a manner to take into account the statistical properties of a background noise signal and the nature of natural speech. The result of the processing may provide respective improved signal-to-noise ratio (SNR) estimates of the sub-band signals to be used for calculating appropriate time-frequency gain values.
The present multi-band noise reduction system and methodology have numerous applications in addition to the previously discussed sound quality and/or speech intelligibility improvements. The multi-band noise reduction system and methodology may form part of front-ends of voice control or speech recognition systems which benefit by the improved signal-to-noise ratio (SNR) of the noise reduced digital audio output signal. The invention may e.g. be useful in applications such as hands-free systems, headsets, hearing aids, active ear protection systems, mobile telephones, teleconferencing systems, karaoke systems, public address systems, mobile communication devices, hands-free communication devices, voice control systems, car audio systems, navigation systems, audio capture, video cameras, and video telephony. The improved SNR of the noise reduced digital audio output signal may be used to provide noise reduction, speech enhancement or suppression of residual echo signals in an echo cancellation system. The improved SNR of the noise reduced digital audio output signal may also be exploited to improve the recognition rate in a voice control system.
Traditional methods for enhancing the quality of a noise infected target signal include beamforming and noise reduction techniques. Single channel noise reduction algorithms can operate on a communication signal, for example, a single microphone audio signal or on a beam-formed signal which is the result of a beamforming operation on multiple microphone audio signals. This invention can be used as part of a noise reduction system in either case.
It is assumed in the following that an analysis filterbank is in place processing a time domain signal y(t). An example is a complex DFT filterbank according to:
                                          Y            k                    ⁡                      (            n            )                          =                              ∑                          l              =              0                                      L              -              1                                ⁢                                    y              ⁡                              (                                  nD                  -                  l                                )                                      ⁢                                          w                A                            ⁡                              (                l                )                                      ⁢                          e                                                -                  2                                ⁢                π                ⁢                                                                  ⁢                                  jkl                  /                  L                                                                                        (        1        )            where k designates a subband index, n is the frame (time) index, WA(l) is the analysis window function, L is the frame length, and D is the filterbank decimation factor. In other implementations, the noise subband signal Yk(n) may be available as a result of other processing steps, such as beamforming, echo cancellation, wind noise reduction, etc.
It is common for noise reduction systems to operate on the principle of estimating an SNR in the time-frequency domain; for example the maximum likelihood SNR estimate ξkML(n) is defined as
                                          ξ            k            ML                    ⁡                      (            n            )                          =                  max          ⁡                      (                          0              ,                                                                                                                                                                  Y                          k                                                ⁡                                                  (                          n                          )                                                                                                            2                                                                                                      σ                        ^                                            k                      2                                        ⁡                                          (                      n                      )                                                                      -                1                                      )                                              (        2        )            
Here, {circumflex over (σ)}k2(n) is a noise power density estimator, obtained from a noise estimator algorithm, of which a multitude are known [4], and will not be described here. Because the maximum likelihood SNR estimate can be fluctuating and because it is a biased (i.e. non-central) estimator, it is common to introduce a further processing step known as decision directed processing (DD) [1]. In DD, an a priori SNR estimate ξk(n) is introduced, as
                                          ξ            k                    ⁡                      (            n            )                          =                              α            ⁢                                                  ⁢                                                                                                      A                      ^                                        k                                    ⁡                                      (                                          n                      -                      1                                        )                                                  2                                                                                  σ                    ^                                    k                  2                                ⁡                                  (                  n                  )                                                              +                                    (                              1                -                α                            )                        ⁢                                          ξ                k                ML                            ⁡                              (                n                )                                                                        (        3        )            
Here, α is a weighting parameter (usually chosen in the range 0.94 . . . 0.99), Âk(n)2 is the speech magnitude estimate, based on a speech estimator algorithm, of which a multitude exists [3][4], in general
                                                                        A                ^                            k                        ⁡                          (              n              )                                2                =                              G            ⁡                          (                                                                    ξ                    k                                    ⁡                                      (                    n                    )                                                  ⁢                                                                                                                                                    Y                          k                                                ⁡                                                  (                          n                          )                                                                                                            2                                                                                                      σ                        ^                                            k                      2                                        ⁡                                          (                      n                      )                                                                                  )                                ⁢                                                                Y                k                            ⁡                              (                n                )                                                                                    (        4        )            where the function G(⋅, ⋅) is known as a gain function. Well known examples of gain functions are Wiener filter, spectral subtraction, and more advanced methods such as STSA [1], LSA and MOSIE [2]. Because of their complexity, practical embodiments of such gain functions require storage of a two-dimensional lookup-table.
The output signal is reconstructed from the estimated spectral magnitudes Âk(n)2 and the noisy phases ∠Yk(n) using a synthesis filterbank. It is well known that the maximum likelihood SNR estimate ξkML(n) is not a central estimator. This is due to the truncation of negative values. FIG. 4 shows the bias in an experiment where noise samples were generated at SNR values corresponding to the x-axis, and the average estimate ξkML(n) is graphed. The DD approach is known for, when used in combination with certain gain functions, introducing a negative bias that to a certain extent counter-acts the bias of the maximum likelihood estimator [3]. The DD approach is further known for effectively introducing temporal averaging the SNR estimate when the SNR is low [3].
One significant disadvantage of the DD approach is that the interaction between the chosen component algorithms, i.e. the particular type of speech and noise estimator applied and which gain function is used is unclear. It is not generally possible to compensate for any differences that arise if, say, the gain function is replaced. Even basic parameters such as the filterbank parameters D and L and signal sample rate all can have a large influence on the sound quality of the resulting output. The present invention has advantages over the traditional DD approach, by allowing to compensate for system parameters, and noise estimator and speech estimator properties, and further allows the SNR processing to be adapted to properties relating to the noise environment. It is able to act in many aspects similarly to the DD-approach for a given setup, and it further allows tuning to be made, and extends support to filterbank configurations that would not work well using DD.