The present invention relates to signal processing techniques used to reduce the noise level present in an input signal.
An important field of application is that of audio signal processing (speech or music), including in a nonlimiting way:                teleconferencing and videoconferencing in a noisy environment (in a dedicated room or even from multimedia computers, etc.);        telephony: processing at terminals, fixed or portable and/or in the transport networks;        hands-free terminals, in particular office, vehicle or portable terminals;        sound pick-up in public places (station, airport, etc.);        hands-free sound pick-up in vehicles;        robust speech recognition in an acoustic environment;        sound pick-up for cinema and the media (radio, television, for example for sports journalism or concerts, etc.).        
The invention can also be applied to any field in which useful information needs to be extracted from a noisy observation. In particular, the following fields can be cited: submarine imaging, submarine remote sensing, biomedical signal processing (EEG, ECG, biomedical imaging, etc.).
A characteristic problem of sound pick-up concerns the acoustic environment in which the sound pick-up microphone is placed and more specifically the fact that, because it is impossible to fully control this environment, an interfering signal (referred to as noise) is also present within the observation signal.
To improve the quality of the signal, noise reduction systems are developed with the aim of extracting the useful information by performing processing on the noisy observation signal. When the audio signal is a speech signal transmitted from a long distance away, these systems can be used to increase its intelligibility and to reduce the strain on the correspondent. In addition to these applications of spoken communication, improvement in speech signal quality also turns out to be useful for voice recognition, the performance of which is greatly impaired when the user is in a noisy environment.
The choice of a signal processing technique for carrying out the noise reduction operation depends first on the number of observations available at the input of the process. In the present description, we will consider the case in which only one observation signal is available. The noise reduction methods adapted for this single-capture problematic rely mainly on signal processing techniques such as adaptive filtering with time advance/delay, parametric Kalman filtering, or even filtering by short-time spectral modification.
The latter family (filtering by short-time spectral modification) combines practically all the solutions used in industrial equipment due to the simplicity of concepts involved and the wide availability of basic tools (for example the discrete Fourier transform) required to program them. However, the rapid advance of these noise reduction techniques relies heavily on the possibility of easily performing these processing operations in real time on a signal processing processor, without introducing major distortions on the signal available at the output of the processing operation. In the methods of this family, the processing most often only consists in estimating a transfer function of a noise-reducing filter, then in performing the filtering based on a multiplication in the spectral domain, which enables the noise reduction by short-time spectral attenuation to be carried out, with processing by blocks.
The noisy observation signal, arising from the mixing of the desired signal s(n) and the interfering noise b(n), is denoted x(n), where n denotes the time index in discrete time. The choice of a representation in discrete time is related to an implementation directed toward the digital processing of the signal, but it will be noted that the methods described above apply also to continuous time signals. The signal is analyzed in successive segments or frames of index k of constant length. Notations currently used for representations in the discrete time and frequency domains are:                X(k,f): Fourier transform (f is the frequency index) of the k-th frame (k is the frame index) of the analyzed signal x(n);        S(k,f): Fourier transform of the k-th frame of the desired signal s(n);        {circumflex over (ν)}: estimation of a quantity (in the time or frequency domain) ν; for example Ŝ(k,f) is the estimation of the Fourier transform of the desired signal;        γuu(f): power spectral density (PSD) of a signal u(n).        
In most noise reduction techniques, the noisy signal x(n) undergoes filtering in the frequency domain to produce a useful estimated signal ŝ(n) which is as close as possible to the original signal s(n) free from any interference. As indicated previously, this filtering operation consists in reducing each frequency component f of the noisy signal given the estimated signal-to-noise ratio (SNR) in this component. This SNR, dependent on the frequency f, is denoted here as η(k,f) for the frame k.
For each of the frames, the signal is first multiplied by a weighting window for improving the later estimation of the spectral quantities required to calculate the noise-reducing filter. Each frame thus windowed is then analyzed in the spectral domain (generally using the discrete Fourier transform in its fast version). This operation is called short-time Fourier transform (STFT). This frequency-domain representation X(k,f) of the observed signal can be used to simultaneously estimate the transfer function H(k,f) of the noise-reducing filter, and to apply this filter in the spectral domain by simple multiplication of this transfer function by the short-time spectrum of the noisy signal, that is:Ŝ(k,f)=H(k,f).X(k,f)  (1)
The signal thus obtained is then returned to the time domain by simple inverse spectral transform. The denoised signal is generally synthesized by a technique of overlapping and adding of blocks (OLA, “overlap-add”) or a technique of saving of blocks (OLS, “overlap-save”). This operation for reconstructing the signal in the time domain is called inverse short-time Fourier transform (ISTFT).
A detailed description of short-time spectral attenuation methods will be found in the following references: J. S. Lim, A. V. Oppenheim, “Enhancement and bandwidth compression of noisy speech”, Proceedings of the IEEE, vol. 67, pages 1586-1604, 1979; and R. E. Crochiere, L. R. Rabiner, “Multirate digital signal processing”, Prentice Hall, 1983.
The main tasks performed by such a noise reduction system are:                voice activity detection (VAD);        estimation of the power spectral density (PSD) of noise during instants of voice inactivity;        application of a short-time spectral attenuation evaluated based on a rule for suppressing spectral components of noise;        synthesis of the processed signal based on an OLS or OLA type technique.        
The choice of the rule for suppressing noise components is important since it determines the quality of the transmitted signal. These suppression rules modify in general only the amplitude |X(k,f)| of the spectral components of the noisy signal, and not their phase. In general, the following assumptions are made:                the noise and useful signal are statistically decorrelated;        the useful noise is intermittent (presence of periods of silence in which the noise can be estimated);        the human ear is not sensitive to the phase of the signal (see D. L. Wang, J. S. Lim, “The unimportance of phase in speech enhancement”, IEEE Trans. on ASSP, vol. 30, No. 4, pp. 679-681, 1982).        
The short-time spectral attenuation H(k,f) applied to the observation signal X(k,f) on the frame of index k at the frequency-domain component f, is generally determined based on the estimation of the local signal-to-noise ratio η(k,f). A characteristic common to all suppression rules is their asymptotic behavior, given by:H(k,f)≈1 for η(k,f)>>1H(k,f)≈0 for η(k,f)<<1  (2)
The suppression rules currently employed are:                power spectral subtraction (see the above-mentioned article by J. S. Lim and A. V. Oppenheim), for which the transfer function H(k,f) of the noise-reducing filter is expressed as:        
                              H          ⁡                      (                          k              ,              f                        )                          =                                                            γ                ss                            ⁡                              (                                  k                  ,                  f                                )                                                                                      γ                  bb                                ⁡                                  (                                      k                    ,                    f                                    )                                            +                                                γ                  ss                                ⁡                                  (                                      k                    ,                    f                                    )                                                                                        (        3        )                            amplitude spectral subtraction (see S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction”, IEEE Trans. on Audio, Speech and Signal Processing, vol. 27, No. 2, pp. 113-120, April 1979), for which the transfer function H(k,f) is expressed as:        
                              H          ⁡                      (                          k              ,              f                        )                          =                  1          -                                                                      γ                  bb                                ⁡                                  (                                      k                    ,                    f                                    )                                                                                                  γ                    bb                                    ⁡                                      (                                          k                      ,                      f                                        )                                                  +                                                      γ                    ss                                    ⁡                                      (                                          k                      ,                      f                                        )                                                                                                          (        4        )                            direct application of the Wiener filter (see the abovementioned article by J. S. Lim and A. V. Oppenheim), for which the transfer function H(k,f) is expressed as:        
                              H          ⁡                      (                          k              ,              f                        )                          =                                            γ              ss                        ⁡                          (                              k                ,                f                            )                                                                          γ                bb                            ⁡                              (                                  k                  ,                  f                                )                                      +                                          γ                ss                            ⁡                              (                                  k                  ,                  f                                )                                                                        (        5        )            
In these expressions, γss(k,f) and γbb(k,f) represent the power spectral densities, respectively, of the useful signal and of the noise present within the frequency-domain component f of the observation signal X(k,f) on the frame of index k.
From expressions (3)-(5), according to the local signal-to-noise ratio measured on a given frequency-domain component f, it is possible to study the behavior of the spectral attenuation applied to the noisy signal. It is noted that all the rules give rise to an identical attenuation when the local signal-to-noise ratio is high. The power subtraction rule is optimal in the sense of maximum likelihood for Gaussian models (see O. Cappé, “Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor”, IEEE Trans. on Speech and Audio Processing, vol. 2, No. 2, pp 345-349, April 1994). But it is the one for which the noise power remains the greatest at the output of the processing. For all the suppression rules, it is noted that a small variation in the local signal-to-noise ratio around the cut-off value is sufficient to bring about a change from the case of total attenuation (H(k,f)≈0) to the case of a negligible spectral modification (H(k,f)≈1).
The latter property constitutes one of the causes of the phenomenon known as “musical noise”. Indeed, ambient noise, characterized both by deterministic and random components, can be characterized only during periods of voice inactivity. Because of the presence of these random components, there are very marked variations between the real contribution of a frequency-domain component f of noise during periods of voice activity and its average estimation carried out over several frames during instants of voice inactivity. Because of this difference, the estimation of the local signal-to-noise ratio can fluctuate around the cut-off level that is, therefore, it can produce, at the output of the processing, spectral components which appear then disappear, and for which the average lifetime does not statistically exceed the order of magnitude of the analysis window considered. Generalization of this behavior over the whole passband introduces a residual noise that is audible and irritating, known as “musical noise”.
There are many studies devoted to reducing the effect of this noise. The recommended solutions are developed along various lines:                averaging of short-time estimations (see above-mentioned article by S. F. Boll);        overestimation of the noise power spectrum (see M. Berouti et al, “Enhancement of speech corrupted by acoustic noise”, Int. Conf. on Speech, Signal Processing, pp. 208-211, 1979; and P. Lockwood, J. Boudy, “Experiments with a non-linear spectral subtractor, hidden Markov models and the projection for robust speech recognition in cars”, Proc. of EUSIPCO'91, pp. 79-82, 1991);        tracking the minima of the noise spectral density (see R. Martin, “Spectral subtraction based on minimum statistics”, in Signal Processing VII: Theories and Applications, EUSIPCO'94, pp. 1182-1185, September 1994).        
There have also been many studies on establishing new suppression rules based on statistical models of signals of speech and of additive noise. These studies have led to the introduction of new “soft decision” algorithms since they have an additional degree of freedom compared to conventional methods (see R. J. Mac Aulay, M. L. Malpass, “Speech enhancement using a soft-decision noise suppression filter”, IEEE trans. on Audio, Speech and Signal Processing, vol. 28, No. 2, pp. 138-145, April 1980, Y. Ephraim, D. Malah, “Speech enhancement using optimal non-linear spectral amplitude estimation”, Int. Conf. on Speech, Signal Processing, pp. 1118-1121, 1983, Y. Ephraim, D. Malha, “Speech enhancement using a minimum mean square error short-time spectral amplitude estimator”, IEEE Trans. on ASSP, vol. 32, No. 6, pp. 1109-1121, 1984).
The abovementioned short-time spectral modification rules have the following characteristics:                the calculation of short-time spectral attenuation relies on the estimation of the signal-to-noise ratio on each of the spectral components, equations (3)-(5) each including the quantity:        
                              η          ⁡                      (                          k              ,              f                        )                          =                                            γ              ss                        ⁡                          (                              k                ,                f                            )                                                          γ              bb                        ⁡                          (                              k                ,                f                            )                                                          (        6        )                             Thus, the performance of the noise reduction technique (distortions, effective reduction in noise level) are governed by the pertinence of this estimator of the signal-to-noise ratio.        These techniques are based on blockwise processing (with the possibility of overlapping between the successive blocks) which consists in filtering all the samples of a given frame, present at the input of the noise reduction device, by a single spectral attenuation. This property lies in the fact that the filter is applied by a multiplication in the spectral domain. This is particularly restricting when the signal present on the current frame does not comply with the second order stationarity assumptions, for example in the case of a start or end of a word, or even in the case of a mixed voiced/unvoiced frame.        The multiplication carried out in the spectral domain corresponds in reality to a cyclic convolution operation. In practice, to avoid distortions, the operation attempted is a linear convolution, which requires both adding a certain number of zero samples to each input frame (technique referred to as “zero padding”) and performing additional processing aimed at limiting the time-domain support of the impulse response of the noise-reducing filter. Satisfying the time-domain convolution constraint thus necessarily increases the order of the spectral transform and, consequently, the arithmetic complexity of the noise-reducing processing. The technique used most to limit the time-domain support of the impulse response of the noise-reducing filter consists in introducing a constraint in the time domain, which requires (i) a first “inverse” spectral transformation for obtaining the impulse response h(k,n) based on the knowledge of the transfer function of the filter H(k,f), (ii) a limitation of the number of points of this impulse response, leading to a truncated time-domain filter h′(k,n), then (iii) a second “direct” spectral transformation for obtaining the modified transfer function H′(k,f) based on the truncated impulse response h′(k,n).        In practice, each analysis frame is multiplied by an analysis window w(n) before performing the spectral transform operation. When the noise-reducing filter is of all-pass type (that is H(k,f)≈1, ∀f), the analysis window must satisfy the following condition        
                                          ∑            k                    ⁢                      w            ⁡                          (                              n                -                                  k                  ·                  D                                            )                                      =        1                            (        7        )                             if it is desired that the condition of perfect reconstruction is satisfied. In this equation, the parameter D represents the shift (in number of samples) between two successive analysis frames. On the other hand, the choice of the weighting window w(n) (typically of Hanning, Hamming, Blackman, etc. type) determines the width of the main lobe of W(f) and the amplitude of the secondary lobes (relative to that of the main lobe). If the main lobe is broad, the fast transitions of the transform of the original signal are very badly approximated. If the relative amplitude of the secondary lobes is large, the approximation obtained has irritating oscillations, especially around the discontinuities. It is therefore difficult to satisfy both the pertinent spectral analysis requirement (choice of the width of the main lobe, and of the amplitude of the side lobes) and the requirement of small delay introduced by the noise reduction filtering process (time shift between the signal at the input and at the output of the processing). Satisfying the second requirement leads to using successive frames without any overlap and therefore a rectangular-type analysis window, which does not result in performing a pertinent spectral analysis. The only way to satisfy both these requirements at the same time is to perform a spectral analysis based on a first spectral transformation carried out on a frame weighted by an appropriate analysis window (to perform a good spectral estimation), and in parallel to perform a second spectral transformation on unwindowed data (in order to carry out the convolution operation by spectral multiplication). In practice, such a technique proves to be far too costly in terms of arithmetic complexity.        
EP-A-0 710 947 disloses a noise reduction device coupled to an echo canceler. The noise reduction is carried out by blockwise filtering in the time domain, by means of an impulse response obtained by inverse Fourier transformation of the transfer function H(k,f) estimated according to the signal-to-noise ratio during the spectral analysis.
A primary object of the present invention is to improve the performance of the noise reduction methods.