The present invention relates to digital techniques for suppressing noise in speech signals. It relates more particularly to noise suppression by non-linear spectral subtraction.
Because of the widespread adoption of new forms of communication, in particular mobile telephones, communications are increasingly made in very noisy environments. The noise, added to the speech, then tends to interfere with the communication by preventing optimum compression of the speech signal and creating unnatural background noise. The noise makes understanding the spoken message difficult and tiring.
Many algorithms have been investigated in attempts to reduce the effects of noise in a communication. S. F. Boll (xe2x80x9cSuppression of acoustic noise in speech using spectral subtractionxe2x80x9d, IEEE Trans. on Acoustics, Speech and Signal Processing, Vol. ASSP-27, No. 2, April 1979) has proposed an algorithm based on spectral subtraction. This technique consists of estimating the spectrum of the noise during phases of silence and subtracting it from the received signal. It reduces the received noise level. Its main defect is that it creates musical noise which is particularly bothersome because it is unnatural.
This work was taken up and improved on by D. B. Paul (xe2x80x9cThe spectral envelope estimation vocoderxe2x80x9d, IEEE Trans. on Acoustics, Speech and Signal Processing, Vol. ASSP-29, No. 4, August 1981) and by P. Lockwood and J. Boudy (xe2x80x9cExperiments with a nonlinear spectral subtractor (NSS), Hidden Markov Models and the projection, for robust speech recognition in carsxe2x80x9d, Speech Communication, Vol. 11, June 1992, pages 215-228, and EP-A-0 534 837) and has significantly reduced the level of the noise whilst preserving its natural character. Moreover, this contribution had the merit of incorporating the principle of masking into the computation of the noise suppression filter for the first time. Based on this idea, a first attempt was made by S. Nandkumar and J. H. L. Hansen (xe2x80x9cSpeech enhancement on a new set of auditory constrained parametersxe2x80x9d, Proc. ICASSP 94, pages I.1-I.4) to use explicitly computed masking curves in the spectral subtraction. Despite the disappointing results of the above technique, this contribution had the merit of emphasizing the importance of not degrading the speech signal during noise suppression.
Other methods based on breaking the speech signal down into singular values, and thus on projecting the speech signal into a smaller space, were investigated by Bart De Moore (xe2x80x9cThe singular value decomposition and long and short spaces of noisy matricesxe2x80x9d, IEEE Trans. on Signal Processing, Vol. 41, No. 9, September 1993, pages 2826-2838) and by S. H. Jensen et al. (xe2x80x9cReduction of broad-band noise in speech by truncated QSVDxe2x80x9d, IEEE Trans. on Speech and Audio Processing, Vol. 3, No. 6, November 1995). The principle of the above technique is to consider the speech signal and the noise signal as totally decorrelated and to consider the speech signal to have sufficient predictability to be predicted on the basis of a restricted set of parameters. This technique produces acceptable noise suppression for highly voiced signals, but totally alters the nature of the speech signal. Faced with relatively coherent noise, such as vehicle tire or engine noise, the noise can be more easily predicted than the unvoiced speech signal. There is then a tendency to project the speech signal into part of the vector space of the noise. The method does not take the speech signal into account, in particular unvoiced speech areas where the predictability is low. Moreover, predicting the speech signal on the basis of a small set of parameters prevents all of the intrinsic richness of speech from being taken into account. The limitations of techniques based only on mathematical considerations and overlooking the particular nature of speech are clear.
Finally, other techniques are based on criteria of coherence. The coherence function is particularly well developed by J. A. Cadzow and O. M. Solomon (xe2x80x9cLinear modeling and the coherence functionxe2x80x9d, IEEE Trans. on Acoustics, Speech and Signal Processing, Vol. ASSP-35, No. 1, January 1987, pages 19-28), and its application to noise suppression has been investigated by R. Le Bouquin (xe2x80x9cEnhancement of noisy speech signals: application to mobile radio communicationsxe2x80x9d, Speech Communication, Vol. 18, pages 3-19). This method is based on the fact that the speech signal is significantly more coherent than the noise if a plurality of independent channels is used. The results obtained appear to be fairly encouraging. However, this technique unfortunately requires a plurality of sound pick-up points, which is not always the case.
A main object of the present invention is to propose a new noise suppression technique which takes account of the characteristics of perception of speech by the human ear, so enabling efficient noise suppression without deteriorating the perception of the speech.
The invention therefore proposes a method of suppressing noise in a digital speech signal processed by successive frames, comprising the steps of:
computing spectral components of the speech signal of each frame;
computing, for each frame, overestimates of spectral components of the noise included in the speech signal;
performing a spectral subtraction including at least a first subtraction step in which a respective first quantity dependent on parameters including the overestimate of the corresponding spectral component of the noise for said frame is subtracted from each spectral component of the speech signal of the frame, to obtain spectral components of a first noise-suppressed signal; and
subjecting the result of the spectral subtraction to a transformation into the time domain to construct a noise-suppressed speech signal.
According to the invention, the spectral subtraction further includes the following steps
computing a masking curve by applying an auditory perception model on the basis of spectral components of the first noise-suppressed signal;
comparing overestimates of the spectral components of the noise for the frame to the computed masking curve; and
a second subtraction step in which a respective second quantity depending on parameters including a difference between the overestimate of the corresponding spectral component of the noise and the computed masking curve is subtracted from each spectral component of the speech signal of the frame.
The second quantity subtracted can in particular be limited to the fraction of the overestimate of the corresponding spectral component of the noise which is above the masking curve. This approach is based on the observation that it is sufficient to suppress audible noise frequencies. In contrast, there is no utility eliminating noise masked by speech.
It is generally desirable to overestimate the spectral envelope of the noise so that the overestimate thereby obtained is robust to sudden variations of the noise. However, excessive overestimation usually has the drawback of distorting the speech signal. This affects the voiced character of the speech signal, eliminating some of its predictability. This drawback is very bothersome in telephony, since it is in the voiced areas that the speech signal then has the most energy. The invention greatly attenuates this drawback by limiting the subtracted quantity if the whole or part of a frequency component of the overestimated noise proves to be masked by the speech.