The present invention relates to the processing of noisy sound signals. It relates in particular to the reduction of the noise present in such signals.
Techniques for reducing noise, that is to say a disturbing signal, within a sound signal are known. They are aimed at taking account of the acoustic environment in which the sound signal appears so as to improve the quality and the intelligibility of this signal. These techniques consist in extracting the useful information from the sound signal considered by performing a processing on this noisy signal. Such techniques apply for example to spoken communications, in applications such as telephony, teleconferencing, videoconferencing where the sound signal is then transmitted between several talkers. They apply moreover in respect of applications of sound pick-up in noisy surroundings, or else of voice recognition, the performance of which is greatly altered when the voice signal is pronounced in a noise-filled environment.
These techniques usually consist in estimating a transfer function of a noise reduction filter, then in carrying out a filtering processing on the basis of a multiplication in the spectral domain. They come within approaches termed “noise reduction by short-term spectral attenuation”.
According to these techniques, the sound signal considered x(n) comprises a useful signal component s(n) and a noise component b(n), n representing a temporal index in discrete time. It will however be noted that a representation of the signal in continuous time could also be adopted. The signal x(n) is organized as successive frames x(n,k) of constant length and of index k. Each of these frames is firstly multiplied by a weighting window making it possible to improve the later estimation of the spectral quantities necessary for the calculation of the noise reduction filter. Each frame thus windowed is then analyzed in the spectral domain, for example with the aid of a discrete or fast Fourier transformation. This operation is called short-term Fourier transformation (STFT).
The frequency representation X(k,f) thus obtained of the signal observed, where f is a frequency index, makes it possible at one and the same time to estimate the transfer function H(k,f) of the noise reduction filter, and to apply this filter in the spectral domain by simple multiplication between this transfer function and the short-term spectrum of the noisy signal. The result of the filtering may thus be written:Ŝ(k,f)=H(k,f)X(k,f).
A return to the time domain of the signal obtained is then performed by an inverse spectral transform. The corresponding temporal signal is finally synthesized by a block overlap and add technique (OLA standing for “overlap add”) or else by a block save technique (OLS standing for “overlap save”). This operation of reconstructing the signal in the time domain is called inverse short-term Fourier transformation (ISTFT).
A detailed description of the methods of short-term spectral attenuation will be found in the references: J. S. Lim, A. V. Oppenheim, “Enhancement and bandwidth compression of noisy speech”, Proceedings of the IEEE, Vol. 67, pp. 1586-1604, 1979; and R. E. Crochiere, L. R. Rabiner, “Multirate digital signal processing”, Prentice Hall, 1983.
The short-term spectral attenuation H(k,f) applied to the observation signal X(k,f) over the temporal segment of index k and with the frequency component f, is generally determined on the basis of the estimate of the local signal-to-noise ratio SNR(k,f). A characteristic common to all suppression rules resides in their asymptotic behavior, given by:H(k,f)≈1 for SNR(k,f)>>1H(k,f)≈0 for SNR(k,f)<<1.
In most techniques, the following assumptions are made: the noise and the useful signal are statistically uncorrelated, the useful signal is intermittent (presence of periods of silence) and the human ear is not sensitive to the signal phase (which is not in general modified by the processing).
Among the suppression rules commonly employed may be cited by way of example: power spectral subtraction, amplitude spectral subtraction and direct implementation of the Wiener filter. For these rules, the short-term estimate of the frequency component f of the useful speech signal may be written respectively:
                                                        S              ^                        SSP                    ⁡                      (                          k              ,              f                        )                          =                                                                              γ                  ss                                ⁡                                  (                                      k                    ,                    f                                    )                                                                                                  γ                    ss                                    ⁡                                      (                                          k                      ,                      f                                        )                                                  +                                                      γ                    bb                                    ⁡                                      (                                          k                      ,                      f                                        )                                                                                ⁢                      X            ⁡                          (                              k                ,                f                            )                                                          (        1        )            for the power spectral subtraction (see the aforesaid article by J. S. Lim and A. V. Oppenheim);
                                                        S              ^                        SSA                    ⁡                      (                          k              ,              f                        )                          =                              [                          1              -                                                                                          γ                      bb                                        ⁡                                          (                                              k                        ,                        f                                            )                                                                                                                          γ                        ss                                            ⁡                                              (                                                  k                          ,                          f                                                )                                                              +                                                                  γ                        bb                                            ⁡                                              (                                                  k                          ,                          f                                                )                                                                                                                  ]                    ⁢                      X            ⁡                          (                              k                ,                f                            )                                                          (        2        )            for amplitude spectral subtraction (see S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction”, IEEE Trans. on Audio, Speech and Signal Processing, Vol. 27, No. 2, pp. 113-120, April 1979); and
                                                        S              ^                        Wiener                    ⁡                      (                          k              ,              f                        )                          =                                                            γ                ss                            ⁡                              (                                  k                  ,                  f                                )                                                                                      γ                  ss                                ⁡                                  (                                      k                    ,                    f                                    )                                            +                                                γ                  bb                                ⁡                                  (                                      k                    ,                    f                                    )                                                              ⁢                      X            ⁡                          (                              k                ,                f                            )                                                          (        3        )            for Wiener filtering (cf. aforesaid article by J. S. Lim and A. V. Oppenheim).
In these expressions, γss(k,f) and γbb(k,f) respectively represent the power spectral densities of the useful signal and of the noise that are present within the frequency component f of the observation signal X(k,f) over the time window of index k.
On the basis of the above expressions, it is possible to study, as a function of the local signal-to-noise ratio measured on a given frequency component f, the behavior of the spectral attenuation applied to the noisy signal. These curves are plotted in FIG. 1 for the abovementioned three short-term suppression rules. It may be noted that the set of rules gives a substantially identical attenuation when the local signal-to-noise ratio is significant (right-hand part of FIG. 1). The optimal power subtraction rule, optimal in the sense of the maximum likelihood for Gaussian models (see O. Cappé, “Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor”, IEEE Trans. on Speech and Audio Processing, Vol. 2, No. 2, pp. 345-349, April 1994) is still the one for which the power of the noise remains the most significant at the output of the processing. For the three suppression rules, we may note that a small variation in the local signal-to-noise ratio about a cutoff value suffices for switching from the case of total attenuation (H(k,f)≈0) to the case of negligible spectral modification (H(k,f)≈1).
This latter property constitutes one of the causes of the phenomenon dubbed “musical noise”. Specifically, the ambient noise, comprising at one and the same time deterministic and random components, can be characterized only during the periods of vocal non-activity. On account of the presence of random components, there are very strong variations between the real contribution of a frequency component f of the noise during the periods of vocal activity and its average estimate made over several frames during the instants of vocal non-activity. On account of this difference, the estimate of the local signal-to-noise ratio may fluctuate about the cutoff level and hence give rise at the output of the processing to spectral components which appear and then disappear and whose average lifetime does not statistically exceed the order of magnitude of the analysis window considered. The generalization of this behavior over the whole of the passband introduces audible and annoying residual noise.
Several studies have endeavored to reduce the influence of this residual noise. The solutions advocated follow several avenues: averaging of the short-term estimates (cf. aforesaid article by S. F. Boll), overestimation of the noise power spectrum (see M. Berouti et al., “Enhancement of speech corrupted by acoustic noise”, Int. Conf. on Speech, Signal Processing, pp. 208-211, 1979; and P. Lockwood, J. Boudy, “Experiments with a non-linear spectral subtractor, hidden Markov models and the projection for robust speech recognition in cars”, Proc. of EUSIPCO'91, pp. 79-82, 1991), or else tracking of the minima of the noise spectral density (see R. Martin, “Spectral subtraction based on minimum statistics”, in Signal Processing VII: Theories and Applications, EUSIPCO'94, pp. 1182-1185, September 1994).
A relatively effective solution for suppressing musical noise consists of an estimator of the power spectral density of the useful signal termed “directed-decision” (see Y. Ephraim, and D. Malah, “Speech enhancement using a minimum mean square error short-time spectral amplitude estimator”, IEEE Trans. on Audio, Speech and Signal Processing, Vol. 32, No. 6, pp. 1109-1121, 1984 and the aforesaid article by O. Cappé). This estimator effects a compromise between the instantaneous and long-term power spectral density of the useful signal, thereby making it possible to effectively eliminate the musical noise. It is moreover known to improve this solution by making up the delay inherent in this estimator (see FR2820227 and C. Plapous, C. Marro, L. Mauuary, P. Scalart, “A Two-Step Noise Reduction Technique”, ICASSP, May 2004).
Several studies have also pertained to the establishing of new suppression rules based on statistical models of the speech and additive noise signals. These studies have made it possible to introduce new algorithms dubbed “soft-decision” algorithms since they possess an additional degree of freedom with respect to the conventional methods (see R. J. Mac Aulay, M. L. Malpass, “Speech enhancement using a soft-decision noise suppression filter”, IEEE trans. on Audio, Speech and Signal Processing, Vol. 28, No. 2, pp. 137-145, April 1980, Y. Ephraim, D. Malah, “Speech enhancement using optimal non-linear spectral amplitude estimation”, Int. Conf. on Speech, Signal Processing, pp. 1118-1121, 1983, and Y. Ephraim, D. Malah article, “Speech enhancement using a minimum mean square error short-time spectral amplitude estimator”, stated above).
As was mentioned above, the calculation of the short-term spectral attenuation relies on the estimation of the signal-to-noise ratio on each of the spectral components. By way of example, the equations given above each involve the following quantity:
      SNR    ⁡          (              k        ,        f            )        =                              γ          ss                ⁡                  (                      k            ,            f                    )                                      γ          bb                ⁡                  (                      k            ,            f                    )                      .  
Thus, the performance of the noise reduction technique, especially in terms of distortions and of effective reduction of the noise level, are governed by the relevance of this estimator of the signal-to-noise ratio.
This defect constitutes the major limitation of the known speech denoising systems. Specifically, the current denoising systems are incapable of denoising the harmonics characterized by too low a signal-to-noise ratio. In practice, the denoising algorithms use the SNR to detect the presence or the absence of a speech component for each frequency. If the estimated SNR is too unfavorable, then the algorithm considers that there is no signal component and suppresses it. Thus, harmonics may be destroyed by the known denoising systems, although it is known a priori that such harmonics must exist. Now, it should be noted that, in the majority of languages, the voiced sounds (harmonics) represent a very large part of the sounds uttered.
An object of the present invention is to overcome the limitation of the known denoising systems.
Another object of the invention is to improve the performance of noise reduction methods.
Another object of the invention is to propose a sound signal processing which does not distort the signal excessively. In particular, the processing of the signal performed makes it possible to preserve all or part of the harmonics included in this signal.
Another object of the invention is to limit the appearance of musical noise on completion of the sound signal processing.
Another object of the invention is to obtain a good estimate of the harmonic comb of a useful signal.