Reduction of acoustic noise is important in different fields, in particular for speech communication. For example, noise suppression in telephonic communications can be very beneficial if the telephony system is used in a noisy environment such as a car cabin or in the street. Noise reduction is crucial in hands-free telephony systems, where the noise level is usually higher because of the distance between the microphone(s) and the speaker(s). Furthermore, speech recognition systems, in which a device or a service is controlled by vocal commands, suffer a decrease of recognition rate when operated in noisy environments. Hence, the reduction of the noise level is also useful in order to improve the reliability of such systems.
Noise suppression in spoken communication, also called “speech enhancement”, has received a large interest for more than three decades and many methods have been proposed to reduce the noise level in speech recordings. Most of these systems rely on the on-line estimation of a “background noise” which is assumed to be stationary, i.e. to change slowly over time. However, this assumption is not always verified in the case of a real noisy environment. Indeed, the passing by of a truck, the closing of a door or the operation of some kinds of machines such as a printer, are examples of non-stationary noises which can frequently occur.
Another technique, called Non-negative Matrix Factorisation (NMF) has recently been applied to this problem. This method is based on a decomposition of the power spectrogram of the mixture into a non-negative combination of several spectral bases, belonging to either the speech or the interfering noise. NMF methods have been used in that context with relatively good results. The basic principle of NMF-based audio processing 100 as schematically illustrated in FIG. 1 is to find a locally optimal factorization of a short-time magnitude spectrogram V 103 of an audio signal 101 into two factors W and H, of which the first one W represents the spectra of the events occurring in the signal 101 and the second one H their activation over time. The first factor W describes the component spectra of the source model 109. The second factor H describes the activations 107 of the signal spectrogram 103 of the audio signal 101. The first factor W and the second factor H are matched with the short-time magnitude spectrogram V 103 of the audio signal 101 by an optimization procedure. The source model 109 is pre-defined when applying supervised NMF and a joint estimation is applied for the source model 109 when using unsupervised NMF. The source signal or signals 113 can be derived from the source spectrogram 111. This approach has the advantage of using no stationarity assumption and gives good results in general.
However, the estimation of the noise components from the signal can be computationally intensive with the NMF technique. Furthermore, systems based on NMF do not take into account the fact that the noise, or a part of it, can be stationary. Hence, conventional noise estimators are often superior to NMF for capturing the stationary component of the background noise, while being less complex.
Common methods for noise reduction, often denoted as “speech enhancement”, include for example spectral subtraction as described by M. Berouti, R. Schwartz and J. Makhoul: “Enhancement of Speech Corrupted by Acoustic Noise”, Proc. IEEE ICASSP 1979, vol. 4, pp. 208-211, Wiener filtering as described by E. Hänsler, G. Schmidt, “Acoustic Echo and Noise Control”, Wiley, Hoboken, N.J., USA, 2004 or so-called Minimum Mean-Square Error Log-Spectral Amplitude as described by Y. Ephraim, D. Malah: “Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator”, IEEE Trans. Acoust., Speech and Signal Process., vol. 33, pp. 443-445, 1985. These techniques are all based on a prior estimation of the background noise power spectrum, which is then “removed” from the original signal. However, they also assume that the background noise can be reliably predicted from the recent past of the signal. Hence, these approaches do not well handle highly non-stationary noise types.
Noise power spectrum estimation methods involve, for example, the averaging of the short-time power spectrum in times frames where speech is absent according to a voice activity detector as shown by M. Berouti, R. Schwartz and J. Makhould: “Enhancement of Speech Corrupted by Acoustic Noise”, Proc. IEEE ICASSP 1979, vol. 4, pp. 208-211, or the smoothing of the minimum value in each considered spectral band as shown by R. Martin, “Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics”, IEEE Trans. On Speech and Audio Process., vol. 9, n. 5, Jul. 2001. Other methods include the so-called minima-controlled recursive averaging as described by N. Fan, J. Rosca, R. Balan, “Speech Noise Estimation Using Enhanced Minima Controlled Recursive Averaging”, Proc. IEEE ICASSP 2007, vol. 4, pp. 581-584 or NMF as described by N. Mohammadiha, T. Gerkmann, A. Leijon, “A New Linear MMSE Filter for Single Channel Speech Enhancement Based on Nonnegative Matrix Factorization”, Proc. of the 2011 IEEE Workshop on Application of Signal Process. to Audio and Acoustics, pp. 45-48.
Recently, the NMF technique has been introduced for the direct reduction of noise in speech recordings from single-channel input. The conventional formulation of NMF is defined as follows. V is defined as a m×n matrix of non-negative real values. The goal is to approximate this matrix by the product of two other non-negative matrices Wε+m×r and Hε+r×n, where r<<m, n. In mathematical terms, a cost function, measuring the “reconstruction error” between V and W·H, is minimized.
When processing sounds, the input matrix V is given by the succession of short-time magnitude (or power) spectra of the input signal, each column of the matrix containing the values of the spectrum computed at a specific instance in time. These features are given by a short-time Fourier transform (STFT) of the input signal, after some window function is applied to it. This matrix contains only non-negative values, because of the kind of features used.
The NMF decomposition is illustrated in FIG. 2 by a simple example. The figure represents a spectrogram 201 represented by the matrix V, a matrix of two spectral bases 202 represented by the matrix W and the corresponding temporal weights 203 represented by the matrix H. The greyscale of the spectrogram 201 represents the amplitude of the Fourier coefficients. The spectrogram defines an acoustic scene which can be described as the superposition of two so called “atomic sounds”. By applying a two-component NMF to this spectrogram, the matrices W and H as defined in FIG. 2 can be obtained. Each column of W can be interpreted as a basis function for the spectra contained in V, when weighted with the corresponding values of H.
Since all of these bases and weights are non-negative, they can be used to build two different spectrograms, each of them describing one of the “atomic sounds”. Thus these sounds can be separated from the mixture, even though they sometimes appear at the same time in the original signal. The example of FIG. 2 is simplistic; however the NMF method can provide satisfactory results in separating different sound sources from realistic recordings. In these cases, a larger value of the order of decomposition r is used. Then, each “component”, i.e. the product of one spectral basis with the corresponding temporal weights, is assigned to a specific source. The estimated spectrogram of each source is finally obtained by the sum of all the components attributed to the source.
The above described method has been applied to the separation of speech from noise as shown by K. W. Wilson, B. Raj, P. Smaragdis and A. Divakaran: “Speech Denoising using non-negative matrix factorization with priors” in IEEE Intern. Conf. on Acoustics, Speech and Signal Process., pp. 4029-4032, 2008. One of the advantages of this approach is that it can theoretically cope with any type of environment, including non-stationary noise. However, NMF can be computationally expensive, since it involves matrix multiplications. Furthermore, in the case of stationary noises, the conventional methods for noise spectral power estimation can outperform NMF, often with a very low computational cost.