The present invention relates to a method and apparatus for enhancing acoustic signals, and more particularly, to a method and apparatus that adaptively reducing noise that contaminates acoustic signals.
During recent years, applications of acoustic signal processing have been developing rapidly. These applications comprise hearing aids, speech encoding, speech recognition, etc. A major challenge encountered by the acoustic signal processing related applications is that they usually have to deal with acoustic signals that are already contaminated by background noise. This fact makes the performance of these applications be downgraded. To solve this problem, a great amount of work has been done in the field of noise suppression, and the following papers are incorporated herein by reference:    [1] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-32, no. 6, pp. 1109-1121, 1984.    [2] P. J. Wolfe and S. J. Godsill. “Efficient alternatives to the Ephraim and Malah suppression rule for audio signal enhancement.” EURASIP journal on Applied Signal Processing, 2003. To appear. Special Issue: Audio for Multimedia Communications.    [3] I. Cohen and B. Berdugo, “Noise Estimation by Minima Controlled Recursive Aver-aging for Robust Speech Enhancement,” IEEE Sig. Proc. Let., vol. 9, pp. 12-15, January 2002.    [4] D. E. Tsoukalas, J. N. Mourjopoulos, and G. Kokkinakis, “Speech enhancement based on audible noise suppression,” IEEE Trans. Speech and Audio Processing, vol. 88, pp. 497-514, November 1997.
Many of the proposed noise suppression algorithms are based on the manipulation of the short-time spectral amplitude (STSA) of the contaminated acoustic signal. This kind of STSA manipulation schemes is widely used for its computational advantage. Among others, MMSE (Minimum Mean Square Error) STSA proposed by Ephraim and Malah (reference [1]) is the most popular STSA based algorithm. FIG. 1 shows an acoustic signal enhancement apparatus 100 according to the MMSE STSA algorithm proposed by Ephraim and Malah. The acoustic signal enhancement apparatus 100 comprises a frame decomposition & windowing unit 110, a Fourier transform unit 120, a noise estimation unit 130, an a posteriori SNR (signal-to-noise ratio) estimation unit 140, an a priori SNR estimation unit 150, a spectral gain calculation unit 160, a multiplication unit 170, an inverse Fourier transform unit 180, and a frame synthesis unit 190.
Assume that a clean speech s(t) is contaminated by a background noise d(t), a noisy speech x(t) received by the acoustic signal enhancement apparatus 100 is given byx(t)=s(t)+d(t),  (1)
where t represents a time index. The frame decomposition & windowing unit 110 segments the noisy speech x(t) into frames of M samples. The frame decomposition & windowing unit 110 further applies an analysis window h(t) of a size 2M with a 50% overlap on the segmented noisy speech xn(t) in frame n so as to generate a windowed frame xn′ (t) with 2M samples as follows
                                          x            n            ′                    ⁡                      (            t            )                          =                  {                                                                                          h                    ⁡                                          (                      t                      )                                                        ⁢                                                            x                                              n                        -                        1                                                              ⁡                                          (                      t                      )                                                                                                                    1                  ≤                  t                  ≤                  M                                                                                                                          h                    ⁡                                          (                      t                      )                                                        ⁢                                                            x                      n                                        ⁡                                          (                                              t                        -                        M                                            )                                                                                                                    M                  <                  t                  ≤                                      2                    ⁢                    M                                                                                                          (        2        )            
The Fourier transform unit 120 applies a spectral transformation applies a discrete Fourier transform on the windowed frame xn′(t) to generate Xn(k), which can be thought of as a spectral representation of xn′(t). Herein n and k refer to the analyzed frame and the frequency bin index respectively. In this example, the acoustic signal enhancement apparatus 100 applies noise suppression to only the spectral amplitude amp[Xn(k)] of the noisy speech. The phase pha[Xn(k)] of the noisy speech is directly used for the enhanced speech without being altered since the phase is trivial for speech quality and speech intelligibility. Herein the term amp[ . . . ] stands for an amplitude operator and the term pha[ . . . ] stands for a phase operator.
The noise estimation unit 130 estimates a noise spectrum λn(k) for each of the spectral representation Xn(k). There are many algorithms that can be applied by the noise estimation unit 130 to estimate the noise spectrum λn(k). For example, the noise estimation unit 130 can obtain the noise spectrum λn(k) by averaging the power spectrum of the noisy speech while only noise is included in the noisy speech. Reference [3] teaches another method for the noise estimation unit 130 to obtain the noise spectrum λn(k).
Theoretically, the a posteriori SNR γn(k) and the a priori SNR ξn(k) are calculated by
                                          Υ            n                    ⁡                      (            k            )                          =                                                            amp                ⁡                                  [                                                            X                      n                                        ⁡                                          (                      k                      )                                                        ]                                            2                        /            Ε                    ⁢                      {                                          amp                ⁡                                  [                                                            D                      n                                        ⁡                                          (                      k                      )                                                        ]                                            2                        }                                              (        3        )                                                      ξ            n                    ⁡                      (            k            )                          =                                                            amp                ⁡                                  [                                                            S                      n                                        ⁡                                          (                      k                      )                                                        ]                                            2                        /            Ε                    ⁢                      {                                          amp                ⁡                                  [                                                            D                      n                                        ⁡                                          (                      k                      )                                                        ]                                            2                        }                                              (        4        )            
where Dn(k) and Sn(k) are the discrete Fourier transform of d(t) and s(t) respectively. E{ . . . } stands for an expectation operator. Since E{amp[Dn(k)]2} is not available, the estimated noise spectrum λn(k) will be utilized to approximate E{amp[Dn(k)]2}. Therefore, the a posteriori SNR estimation unit 140 can approximate the a posteriori SNR γn(k) by γn′ (k) asγn′(k)=amp[Xn(k)]2/λn(k)  (5)
Having γn′ (k) for the current frame and γn-1′ (k) for the previously frame, the a priori SNR estimation unit 150 approximates the a priori SNR ξn(k) by ξn′(k) asξn′(k)=αγn-1′(k)Gn-1(k)2+(1−α)P[γn′(k)−1]  (6)
where α is a forgetting factor satisfying 0<α<1, P[ . . . ] is a rectifying function, and Gn-1(k) is the spectral gain determined for the previously frame.
With already determined γn′ (k) and ξn′ (k), the spectral gain calculation unit 160 can obtain the spectral gain for the current frame byGn(k)={ξn′(k)+sqrt[ξn′(k)2+2(1+ξn′(k))(ξn′(k)/γn′(k))]}/[2(1+ξn′(k))]  (7)
where sqrt[ . . . ] is a square root operator.
Next, the multiplication unit 170 multiplies the original spectral amplitude amp[Xn(k)] by the spectral gain Gn(k) to get the enhanced spectral amplitude Gn(k)amp[Xn(k)]. The enhanced spectral representation Yn(k) of the frame xn′ (t) is constructed with enhanced spectral amplitude Gn(k)amp[Xn(k)] and the original phase pha[Xn(t)] as:
                                          Y            n                    ⁡                      (            k            )                          =                                            amp              ⁡                              [                                                      Y                    n                                    ⁡                                      (                    k                    )                                                  ]                                      ×            exp            ⁢                          {                              j                ×                                  pha                  ⁡                                      [                                                                  Y                        n                                            ⁡                                              (                        k                        )                                                              ]                                                              }                                ⁢                                          ⁢                                          =                                                    G                n                            ⁡                              (                k                )                                      ×                          amp              ⁡                              [                                                      X                    n                                    ⁡                                      (                    k                    )                                                  ]                                      ×            exp            ⁢                          {                              j                [                                  pha                  ⁡                                      [                                                                  X                        n                                            ⁡                                              (                        k                        )                                                              ]                                                  }                                                                        (        8        )            
where j=sqrt(−1). Then, the inverse Fourier transform unit 180 applies a discrete inverse Fourier transform on the enhanced spectral representation Yn(k) to get yn′(t). Finally, the frame synthesis unit 190 obtains the enhanced speech yn(t) by performing an overlap-add processing as followsyn(t)=yn-1′(t+M)+yn′(t),1<=t<=M  (9)
The acoustic signal enhancement apparatus 100 works fine only when the SNR of the noisy speech x(t) is sufficiently good. However, when the SNR of the noisy speech x(t) is poor, the acoustic signal enhancement apparatus 100 will overly suppress the actual speech information included in the noisy speech x(t). Musical noise that deteriorates the quality of the enhanced speech yn(t) will probably be generate as a side effect. In other words, the performance of the acoustic signal enhancement apparatus 100 of the related art is not sufficiently good for a wide range of SNR.