In a speech enhancement system, a digital signal processor (DSP) receives an input signal including samples of an analog audio signal. The analog audio signal may be a speech signal. The input signal includes noise and thus is referred to as a “noisy speech” signal with noisy speech samples. The DSP signal processes the noisy speech signal to attenuate the noise and output a “cleaned” speech signal with a reduced amount of noise as compared to the input signal. Attenuation of the noise is a challenging problem because there is no side information included in the input signal defining the speech and/or noise. The only available information is the received noisy speech samples.
Traditional methods exist for attenuating the noise in a noisy speech signal. These methods, however, introduce and/or result in output of “music noise”. Music noise does not necessarily refer to noise of a music signal, but rather refers to a “music-like” sounding noise that is within a narrow frequency band. The music noise is included in cleaned speech signals that are output as a result of performing these traditional methods. The music noise can be heard by a listener and may annoy the listener.
As an example, samples of an input signal can be divided into overlapping frames and a priori signal-to-noise ratio (SNR) ξ(k,l) and a posteriori SNR γ(k,l) may be determined, where: ξ(k,l) is the a priori SNR of the input signal; γ(k,l) is a posteriori (or instantaneous) SNR of the input signal; l is a frame index to identify a particular one of the frames; and k is a frequency bin (or range) index that identifies a frequency range of a short time Fourier transform (STFT) of the input signal. The a priori SNR ξ(k,l) is a ratio of a power level (or frequency amplitude of speech) of a clean speech signal to a power level of noise (or frequency amplitude of noise). The a posteriori SNR γ(k,l) is a ratio of a squared magnitude of an observed noisy speech signal to a power level of the noise. Both the a priori SNR ξ(k,l) and the a posteriori SNR γ(k,l) may be computed for each frequency bin of the input signal. The a priori SNR ξ(k,l) may be determined using equation 1, where λX(k,l) is a priori estimated variance of amplitude of speech of the STFT of the input signal and λN(k,l) is an estimated a priori variance of noise of the STFT of the input signal.
                              ξ          ⁡                      (                          k              ,              l                        )                          =                                            λ              X                        ⁡                          (                              k                ,                l                            )                                                          λ              N                        ⁡                          (                              k                ,                l                            )                                                          (        1        )            
The a posteriori SNR γ(k,l) may be determined using equation 2, where R(k,l) is an amplitude of noisy speech of the STFT of the input signal.
                              γ          ⁡                      (                          k              ,              l                        )                          =                                            R              ⁡                              (                                  k                  ,                  l                                )                                      2                                              λ              N                        ⁡                          (                              k                ,                l                            )                                                          (        2        )            
For each k and l, a gain G is calculated as a function of ξ(k,l) and γ(k,l). The gain G is multiplied by R(k,l) to provide an estimate of an amplitude of clean speech Â(k,l). Each gain value may be greater than or equal to 0 and less than or equal to 1. Values of the gain G are calculated based on ξ(k,l) and γ(k,l), such that frequency bands (or bins) of speech are kept and frequency bands (or bins) of noise are attenuated. An inverse fast Fourier transform (IFFT) of the amplitude of clean speech Â(k,l) is performed to provide time domain samples of the cleaned speech. The cleaned speech refers to the noisy speech portion of the STFT of the input signal that is cleaned (i.e. the noise has been attenuated).
For example, when ξ(k,l) is high, amplitude of speech for the corresponding frequency is high and little noise exists (i.e. amplitude of noise is low). For this condition, the gain G is set close to 1 (or 0 dB) to maintain amplitude of the speech. As a result, the amplitude of clean speech Â(k,l) is set approximately equal to R(k,l). As another example, when ξ(k,l) is low, amplitude of speech for the corresponding frequency is low and strong noise exists (i.e. amplitude of noise is high). For this condition, the gain G is set close to 0 to attenuate the noise. As a result, the amplitude of the clean speech Â(k,l) is set close to 0.
The a priori signal-to-noise ratio (SNR) ξ(k,l) may be estimated using equation 3, where α is a constant between 0 and 1 and P(k,l) is an operator, which may be expressed by equation 4.
                              ξ          ⁡                      (                          k              ,              l                        )                          =                              α            ⁢                                                            A                  ^                                ⁡                                  (                                      k                    ,                                          l                      -                      1                                                        )                                                                              λ                  N                                ⁡                                  (                                      k                    ,                                          l                      -                      1                                                        )                                                              +                                    (                              1                -                α                            )                        ⁢                          P              ⁡                              (                                  k                  ,                  l                                )                                                                        (        3        )                                          P          ⁡                      (                          k              ,              l                        )                          =                  {                                                                                                                                        γ                        ⁡                                                  (                                                      k                            ,                            l                                                    )                                                                    -                      1                                        =                                                                                                                        R                            ⁡                                                          (                                                              k                                ,                                l                                                            )                                                                                2                                                -                                                                              λ                            N                                                    ⁡                                                      (                                                          k                              ,                              l                                                        )                                                                                                                                                λ                          N                                                ⁡                                                  (                                                      k                            ,                            l                                                    )                                                                                                      ,                                                                                                                        R                      ⁡                                              (                                                  k                          ,                          l                                                )                                                              2                                    >                                                            λ                      N                                        ⁡                                          (                                              k                        ,                        l                                            )                                                                                                                                            0                  ,                                                            otherwise                                                                        (        4        )            
FIG. 1 shows a noisy speech signal 10 and a clean speech signal 12. The noisy speech signal 10 includes speech (or speech samples) and noise. The clean speech signal 12 is the speech without the noise. An example frame of the noisy speech signal 10 is within box 14. The frame designated by box 14 has little speech (i.e. amplitude of speech is near zero) and a lot of noise (i.e. amplitude of the noise is high compared to the speech for this frame and/or SNR is low).
FIGS. 2A and 2B show plots that illustrate how music noise is produced. FIG. 2A shows examples of amplitudes of true speech, amplitudes of noisy speech R(k,l), and estimated speech amplitudes Â(k,l). The values of FIG. 2B correspond to the values of FIG. 2A. FIG. 2B shows examples of values of the variables in equation 4.
As illustrated in FIG. 2B, R(k,l)2 and λN(k,l) are both randomly “zigzag-shaped” and are at about the same averaged level (i.e. have similar amplitudes). At some frequency bins, R(k,l)2<λN(k,l) and values of P(k,l) are zero according to equation 4. At other frequency bins, R(k,l)2>λN(k,l) and values of P(k,l) are non-zero values according to equation 4. Since R(k,l)2 and λN(k,l) are randomly zigzag-shaped at some frequency bins, corresponding values of P(k,l) are non-zero, but values of P(k,l) are zero at frequency bins adjacent to the frequency bins having P(k,l) values of non-zero. Therefore, P(k,l) shows isolated peaks at some frequency bins and according to equation 3 and the a priori SNR ξ(k,l) also has isolated peaks for the same frequency bins. Amplitudes of the isolated peaks of the a priori SNR ξ(k,l) may be smaller than the amplitudes of P(k,l) depending on the value of the constant α.
A low value of the a priori SNR ξ(k,l) can lead to a gain that is much smaller than 1 (e.g., close to 0 and greater than or equal to 0). A high value of the a priori SNR ξ(k,l) leads to a gain close to 1 and less than or equal to 1. As a result, the estimated speech amplitude Â(k,l), which is the gain multiplied by the amplitude of noisy speech R(k,l), has isolated peaks at the frequency bins where P(k,l) has isolated peaks. This is shown in FIG. 2A. The isolated peaks of the estimated speech amplitude Â(k,l) are music noise.
R(k,l)2 and λN(k,l) are at a similar average level for the above-stated frame designated by box 14. This is because content of the frame designated by box 14 is mostly noise. For this reason, R(k,l)2 is the instantaneous noise level. λN(k,l) is an estimated smoothed noise level or as stated above the estimated a priori variance of noise. The fact that R(k,l)2 has a similar average level as λN(k,l) indicates λN(k,l) is estimated correctly.