In many applications, it is desirable to remove noise from a signal so that the signal is easier to recognize. For speech signals, such denoising can be used to enhance the speech signal so that it is easier for users to perceive. Alternatively, the denoising can be used to provide a cleaner signal to a speech recognizer.
In some systems, such denoising is performed in cepstral space. Cepstral space is defined by a set of cepstral coefficients that describe the spectral content of a frame of a signal. To generate a cepstral representation of a frame, the signal is sampled at several points within the frame. These samples are then converted to the frequency domain using a Fourier Transform, which produces a set of frequency-domain values. Each cepstral coefficient is then calculated as:                               c          i                =                  C          [                      ln            ⁢                                          ∑                k                            ⁢                                                w                  ik                                ⁢                                  S                  k                                                              ]                                    EQ.  1            where ci is the ith cepstral coefficient, C is a transform, wik is a filter associated with the ith coefficient and the kth frequency, and Sk is the spectrum for the kth frequency, which is defined as:Sk=|{circumflex over (x)}k|2  EQ. 2where {circumflex over (x)}k is an average sample value for the kth frequency.
To perform the denoising in cepstral space, models of clean speech and noise are built in cepstral space by converting clean speech training signals and noise training signals into sets of cepstral coefficient vectors. The vectors are then grouped together to form mixture components. Often, the distribution of vectors in each component is described using a Gaussian distribution that has a mean and a variance.
The resulting mixture of Gaussians for the clean speech signal represents a strong model of clean speech because it limits clean speech to particular values represented by the mixture components. Such strong models are thought to improve the denoising process because they allow more noise to be removed from a noisy speech signal in areas of cepstral space where clean speech is unlikely to have a value.
Although removing noise in the cepstral domain has proven effective, it is limiting in that only the resulting denoised signal can be applied directly to a speech recognition system. As such, removing noise in the cepstral domain does not facilitate providing something other than the denoised cepstral vectors to the recognizer.
In addition, denoising in the cepstral domain is more difficult than removing noise in the time domain or frequency domain. In the time or frequency domains, noise is additive, so noisy speech equals clean speech plus noise. In the cepstral domain, noisy speech is a complicated nonlinear function of clean speech and noise, and the required math becomes intractable and needs to be approximated. This is a separate complication that is independent of the complexity of the models used. Hence, time or frequency domain methods may in theory be able to provide a more accurate denoising since they would not require the approximation found in the cepstral domain.
To overcome these limitations, some systems have attempted to denoise speech signals in the time domain or the frequency domain. However, such denoising systems typically use simple models for the clean speech signal that do not incorporate much information on the structure of speech. As a result, it is difficult to discern noise from clean speech since the clean speech is allowed to take nearly any value.
One common model of clean speech is an auto-regression model that models a next point in a speech signal based on past points in the speech signal. In terms of an equation:                               x          n                =                                            ∑                              m                =                1                            p                        ⁢                                          a                m                            ⁢                              x                                  n                  -                  m                                                              +                      v            n                                              EQ.  3            where xn is the nth sample in the speech signal, xn-m is the n-mth sample in the speech signal, am are auto-regression parameters based on a physical shape of a “lossless tube” model of a vocal tract and vn is a combination of an input excitation and a fitting error.
Because the auto-regression model parameters are based on a physical model rather than a statistical model, they lack a great deal of information concerning the actual content of speech. In particular, the physical model allows for a large number of sounds that simply are not heard in certain languages. Because of this, it is difficult to separate noise from clean speech using such a physical model.
Some prior art systems have generated statistical descriptions of speech that are based on AR parameters. Under these systems, frames of training speech are grouped into mixture components based on some criteria. AR parameters are then selected for each component so that the parameters properly describe the mean and variance of the speech frames associated with the respective mixture component.
Under many such systems, the coefficients of the AR model are selected during training and are not modified while the system is being used. In other words, the model coefficients are not adjusted based on the noisy signal received by the system. In addition, because the AR coefficients are fixed, they are treated as point values that are known with absolute certainty.
In another prior art system described in J. Lim, All-Pole Modeling of Degraded Speech, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-26, No. 3, June 1978, a time domain/frequency domain system is shown in which the AR coefficients are not fixed but instead are modified based on the noisy signal. Under the Lim system, an iteration is performed to alternately update the AR coefficients and then update the denoised signal values. However, even under Lim, the updates to the denoised signal values are based on point values for the AR coefficients that are assumed to be known with certainty.
In reality, the best AR coefficients are never known with certainty. As such, the prior art systems that determine the denoised signal values by using point values for the AR coefficients are less than ideal since they rely on an assumption that is not true.
Thus, a denoising system is needed that operates in the time domain or frequency domain, and that recognizes that parameters of a model description of speech can only be known with a limited amount of certainty. In addition, such a system needs to be computationally efficient.