When a communications terminal is used to make a record of or to transmit a speech signal containing speech, it is inevitable that its microphone will pick up environmental or background noise from the environment in which a speaking person is located. The background noise reduces the ability of a listener to hear or understand the speech and in some cases, if the noise level is sufficiently high, prevents the listener from hearing anything other than the background noise. In addition, such background noise may have a negative effect on the performance of digital signal processing systems in the communications terminal or in an associated communications network, such as speech coding or speech recognition. Typically, noise suppression systems are incorporated in communications terminals and communications networks to limit the effect of background noise.
Noise suppression has been well known for a number of years. Many different approaches and methods have been proposed to achieve three main ends:    (i) suppressing the noise significantly while preserving good speech quality;    (ii) rapid convergence to the optimal solution independent of the nature of the processed noise; and    (iii) improving speech intelligibility for very low speech-to-noise (SNR) ratios.
One noise suppression method based on the linear Minimum Mean Squared Error (MMSE) criteria will be described with reference to FIG. 1. The method operates on a noisy speech signal x(t) containing a speech signal s(t) and a noise signal n(t) such that x(t)=s(t)+n(t). The noisy speech signal x(t) is in the time domain. It is converted into a sequence of frames having consecutive frame numbers k using a windowing function. The frames are then each transformed into the frequency domain using a Fast Fourier Transform (FFT) in block 10 so as to produce a sequence of noisy speech frames where noisy speech signal X(f,k) in the frequency domain contains a speech signal S(f,k) and a noise signal N(f,k) such that X(f,k)=S(f,k)+N(f,k). The frames in the frequency domain comprise a number of frequency bins f. In the frequency domain, the MMSE approach involves minimising the following error function:ε2(f,k)=E{(S(f,k)−{circumflex over (S)}(f,k))·(S(f,k)−{circumflex over (S)}(f,k))*}  (1)where E{•} is the expectation operator, (*) denotes complex conjugation and Ŝ(f,k) represents a linear estimate of the input speech signal. The error ε2(f,k) defined by Equation 1 represents the squared difference between the true speech component contained within the noisy speech signal and the estimate of that speech component, Ŝ(f,k), i.e. the estimate of the noise-free speech component. Thus, minimisation of ε2(f,k) is equivalent to obtaining the best possible estimate of the speech component. Ŝ(f,k) is given by:Ŝ(f,k)=G(f,k)·X(f,k)  (2)where G(f,k) is a gain coefficient. The corresponding solution of the minimisation of ε2(f,k) for each frame takes the form of a computation of the gain coefficient G(f,k) which is multiplied by the associated input frequency bin of that frame to produce the estimated noise-free speech component Ŝ(f,k). This gain coefficient, known as the frequency domain Wiener filter, is given by the ratio below:
                              G          ⁡                      (                          f              ,              k                        )                          =                              E            ⁢                          {                                                S                  ⁡                                      (                                          f                      ,                      k                                        )                                                  ·                                                      X                    *                                    ⁡                                      (                                          f                      ,                      k                                        )                                                              }                                            E            ⁢                          {                                                X                  ⁡                                      (                                          f                      ,                      k                                        )                                                  ·                                                      X                    *                                    ⁡                                      (                                          f                      ,                      k                                        )                                                              }                                                          (        3        )            
The Wiener filter G(f,k), is generated for each frequency bin f of each frame.
The noise-suppressed frames are then transformed back into the time domain in block 14 and then combined together to provide a noise suppressed speech signal ŝ(t). Ideally, ŝ(t)=s(t).
When deriving the Wiener filter, the MMSE approach is equivalent to the orthogonality principle. This principle stipulates that, for each frequency, the input signal X(f,k) is orthogonal to the error S(f,k)−Ŝ(f,k). This means that:E{(S(f,k)−{circumflex over (S)}(f,k))·X*(f,k)}=0  (4)
Because the estimation process is linear, by estimating the signal component of a noisy signal that contains a signal component and a noise component, an estimate of the noise {circumflex over (N)}(f,k) is also effectively obtained. Furthermore, the following orthogonality relationship will also be true:E{(N(f,k)−{circumflex over (N)}(f,k))·X*(f,k)}=0  (5)where {circumflex over (N)}(f,k) indicates the noise estimate. It also follows that for every frequency, the following equality applies:S(f,k)−{circumflex over (S)}(f,k)={circumflex over (N)}(f,k)−N(f,k)  (6)that is, the error associated with the estimate of the noise component {circumflex over (N)}(f,k) is the same as the error associated with the estimated noise-free speech component Ŝ(f,k).
In the remainder of this document, the following notation will be adopted: PUV(f,k) is the cross power spectral density between U(f,k) and V(f,k) (PUV(f,k)=E{U(f,k)·V*(f,k)}). PUU(f,k) is the power spectral density (psd) of U(f,k) (PUU(f,k)=E{U(f,k)·U*(f,k)}).
As a consequence of the above-mentioned orthogonality principle, it is possible to derive an expression for the cross psd PSX(f,k), required in order to compute the Wiener filter described by Equation 3:PSX(f,k)=E{(X(f,k)−{circumflex over (N)}(f,k))·X*(f,k)}  (7)
Moreover, the cross psd PNX(f,k) is given by:PNX(f,k)=E{(X(f,k)−Ŝ(f,k))·X*(f,k)}  (8)
Having in mind the trivial equality PXX(f,k)=PSX(f,k)+PNX(f,k), Equations 3, 6, 7 and 8 introduce and illustrate an idea of adaptive calculation since the Wiener filter (PSX(f,k)/PXX(f,k)) in Equation 3 depends on the estimated signal Ŝ(f,k) (6,7) and (8).
When a minimum is reached, the expression describing the error in Equation 2 takes the following form:
                                          ɛ            min            2                    ⁡                      (                          f              ,              k                        )                          =                                                                              P                  SS                                ⁡                                  (                                      f                    ,                    k                                    )                                            ·                                                P                  XX                                ⁡                                  (                                      f                    ,                    k                                    )                                                      -                                                                                                P                    SX                                    ⁡                                      (                                          f                      ,                      k                                        )                                                                              2                                                          P              XX                        ⁡                          (                              f                ,                k                            )                                                          (        9        )            
It is evident that minimum error, that is εmin2(f,k), is equal to zero only if the desired signal S(f,k) is completely coherent with the input signal X(f,k) (that is, PNN(f,k) tends to zero). This is desirable. Otherwise, there is an error when applying the Wiener filter. The upper limit of this error is PSS(f,k). This is undesirable. In other words, an error-free result can only be obtained if there is actually no noise in the input signal X(f,k). For any finite noise level, a finite error is obtained. It follows that the worst case error occurs when there is no speech signal S(f,k) in X(f,k).