Capturing audio, and in particularly speech, has become increasingly important in the last decades. Indeed, capturing speech has become increasingly important for a variety of applications including telecommunication, teleconferencing, gaming etc. However, a problem in many scenarios and applications is that the desired speech source is typically not the only audio source in the environment. Rather, in typical audio environments there are many other audio/noise sources which are being captured by the microphone. One of the critical problems facing many speech capturing applications is that of how to best extract speech in a noisy environment. In order to address this problem a number of different approaches for noise suppression have been proposed.
One of the most difficult tasks in speech enhancement is the suppression of non-stationary diffuse noise. Diffuse noise is for example an acoustic (noise) sound field in a room where the noise is coming from all directions. A typical example is so-called “babble”-noise in e.g. a cafeteria or restaurant in which there are many noise sources distributed across the room.
When recording a desired speaker in a room with a microphone or microphone array, the desired speech is captured in addition to background noise. Speech enhancement can be used to try to modify the microphone signal such that the background noise is reduced while the desired speech is as unaffected as possible. When the noise is diffuse, one proposed approach is to try to estimate the spectral amplitude of the background noise and to modify the spectral amplitude such that the spectral amplitude of the resulting enhanced signal resembles the spectral amplitude of the desired speech signal as much as possible. The phase of the captured signal is not changed in this approach.
FIG. 1 illustrates an example of a noise suppression system in accordance with prior art. In the example, input signals are received from two microphones with one being considered to be a reference microphone and the other being a main microphone capturing the desired audio source, and specifically capturing speech. Thus, a reference microphone signal x(n) and a primary microphone signal are received. The signals are converted to the frequency domain in transformers 101, 103, and the magnitude in individual time frequency tiles are generated by magnitude units 105, 107. The resulting magnitude values are fed to a unit 109 for calculating gains. The frequency domain values of the primary signal are multiplied by the resulting gains in a multiplier 111 thereby generating a frequency spectrum compensated output signal which is converted to the time domain in another transform unit 113.
The approach can best be considered in the frequency domain. Frequency domain signals are first generated by computing a short-time Fourier transform (STFT) of e.g. overlapping Hanning windowed blocks of the time domain signal. The STFT is in general a function of both time and frequency, and is expressed by the two arguments tk and ωl with tk=kB being the discrete time, and where k is the frame index, B the frame shift, and ωl=l ω0 is the (discrete) frequency, with l being the frequency index and ω0 denoting the elementary frequency spacing.
Let Z(tk,ωl) be the (complex) microphone signal which is to be enhanced. It consists of the desired speech signal Zs(tk,ωl) and the noise signal Zn(tk,ωl):Z(tk,ωl)=Zs(tk,ωl)+Zn(tk,ωl).The microphone signal is fed to a post-processor which performs noise suppression by modifying the spectral amplitude of the input signal while leaving the phase unchanged. The operation of the post-processor can be described by a gain function, which in the case of spectral amplitude subtraction typically has the form:
            G      ⁡              (                              t            k                    ,                      ω            l                          )              =                                                  Z            ⁡                          (                                                t                  k                                ,                                  ω                  l                                            )                                                -                                                      Z              n                        ⁡                          (                                                t                  k                                ,                                  ω                  l                                            )                                                                              Z          ⁡                      (                                          t                k                            ,                              ω                l                                      )                                        ,where |.| is the modulus operation.The output signal is then calculated as:Q(tk,ωl)=Z(tk,ωl)*G(tk,ωl).After being transformed back to the time domain, the time domain signal is reconstructed by combining the current and the previous frame taking into account that the original time signal was windowed and time overlapped (i.e. an overlap-and-add procedure is performed).The gain function can be generalized to:
      G    ⁡          (                        t          k                ,                  ω          l                    )        =                    (                                                                                              Z                  ⁡                                      (                                                                  t                        k                                            ,                                              ω                        l                                                              )                                                                              α                        -                                                                                                Z                    n                                    ⁡                                      (                                                                  t                        k                                            ,                                              ω                        l                                                              )                                                                              α                                                                                        Z                ⁡                                  (                                                            t                      k                                        ,                                          ω                      l                                                        )                                                                    α                          )                    1        /        α              .  
For=1, this describes a gain function for spectral amplitude subtraction, for α=2 this describes a gain function for spectral power which is also often used. The following description will focus on spectral amplitude subtraction, but it will be appreciated that the provided reasoning can also be applied to, in particular, spectral power subtraction.
The amplitude spectrum of the noise in |Zn(tk,ωl)| is in general not known. Therefore, an estimate |Zn(l)| has to be used instead. Since that estimate is not always accurate, an oversubtraction factor γn for the noise is used (i.e. the noise is scaled with a factor of more than one). However, this may also lead to a negative value for |Z(tk,ωl)|−γn|Zn(l)|, which is undesired. For that reason, the gain function is limited to zero or to a certain small positive value.
For the gain function, this results in:
            G      ⁢              (                              t            k                    ,                      ω            l                          )              =          MAX      ⁡              (                                                                                              Z                  ⁡                                      (                                                                  t                        k                                            ,                                              ω                        l                                                              )                                                                              -                                                γ                  n                                ⁢                                                                                              Z                      n                                        ⁡                                          (                                                                      l                                            )                                                                                                                                                                              Z                  ⁡                                      (                                                                  t                        k                                            ,                                              ω                        l                                                              )                                                                              α                                ,          θ                )                  0    ≤          θ      .      
For stationary noise, |Zn(tk,ωl)| can be estimated by measuring and averaging the amplitude spectrum |Z(tk,ωl)| during silence.
However, for non-stationary noise, an estimate of |Zn(tk,ωl)| cannot be derived from such an approach since the characteristics will change with time. This tends to prevent an accurate estimate to be generated from a single microphone signal. Instead, it has been proposed to use an extra microphone to be able to estimate |Zn(tk,ωl)| As a specific example, a scenario can be considered where there are two microphones in a room with one microphone being positioned close to the desired speaker (the primary microphone) and the other microphone being further away from the speaker (the reference microphone). In this scenario, it can often be assumed that the primary microphone contains the desired speech component as well as a noise component, whereas the reference microphone signal can be assumed to not contain any speech but only a noise signal recorded at the position of the reference microphone. The microphone signals can be denoted by:Z(tk,ωl)=Zs(tk,ωl)+Zn(tk,ωl)andX(tk,ωl)=Xn(tk,ωl)for the primary microphone and reference microphone respectively.
To relate the noise components in the microphone signals we define a so-called coherence term as:
            C      ⁡              (                              t            k                    ,                      ω            l                          )              =                  E        ⁢                  {                                                                Z                n                            ⁡                              (                                                      t                    k                                    ,                                      ω                    l                                                  )                                                          }                            E        ⁢                  {                                                                X                n                            ⁡                              (                                                      t                    k                                    ,                                      ω                    l                                                  )                                                          }                      ,where E{.} is the expectation operator. The coherence term is an indication of the average correlation between the amplitudes of the noise component in the primary microphone signal and the amplitudes of the reference microphone signal.
Since C(tk,ωl) is not dependent on the instantaneous audio at the microphones but instead depends on the spatial characteristics of the noise sound field, the variation of C(tk,ωl) as a function of time is much less than the time variations of Zn and Xn.
As a result C(tk,ωl) can be estimated relatively accurately by averaging |Zn(tk,ωl)| and |Xn(tk,ωl)| over time during the periods where no speech is present in z. An approach for doing so is disclosed in U.S. Pat. No. 7,602,926, which specifically describes a method where no explicit speech detection is needed for determining C(tk,ωl).
Similarly to the case for stationary noise, an equation for the gain function for two microphones can then be derived as:
            G      ⁡              (                              t            k                    ,                      ω            l                          )              =          MAX      ⁡              (                                                                                              Z                  ⁡                                      (                                                                  t                        k                                            ,                                              ω                        l                                                              )                                                                              -                                                γ                  n                                ⁢                                  C                  ⁡                                      (                                                                  t                        k                                            ,                                              ω                        l                                                              )                                                  ⁢                                                                        X                    ⁡                                          (                                                                        t                          k                                                ,                                                  ω                          l                                                                    )                                                                                                                                                          Z                ⁡                                  (                                                            t                      k                                        ,                                          ω                      l                                                        )                                                                            ,          θ                )                  0    ≤          θ      .      
Since X does not contain speech, the magnitude of X multiplied by the coherence term C(tk,ωl) can be considered to provide an estimate of the noise component in the primary microphone signal. Consequently, the provided equation may be used to shape the spectrum of the first microphone signal to correspond to the (estimated) speech component by scaling the frequency domain signal, i.e. by:Q(tk,ωl)=Z(tk,ωl)*G(tk,ωl).
However, although the described approach may provide advantageous performance in many scenarios, it may in some scenarios provide less than optimum performance. In particular, in some scenarios, the noise suppression may be less than optimum. In particular, for diffuse noise the improvement in the Signal-to-Noise-Ratio (SNR) may be limited, and often the so-called SNR Improvement (SNRI) is in practice found to be limited to around 6-9 dB. Although, this may be acceptable in some applications, it will in many scenarios tend to result in a significant remaining noise component degrading the perceived speech quality. Furthermore, although other noise suppression techniques can be used, these tend to also be suboptimal and e.g. tend to be complex, inflexible, impractical, computationally demanding, require complex hardware (e.g. a high number of microphones), and/or provide suboptimal noise suppression.
Hence, an improved noise suppression would be advantageous, and in particular a noise suppression allowing reduced complexity, increased flexibility, facilitated implementation, reduced cost (e.g. not requiring a large number of microphones), improved noise suppression and/or improved performance would be advantageous.