The present invention relates to noise reduction. In particular, the present invention relates to reducing noise in signals used in pattern recognition.
A pattern recognition system, such as a speech recognition system, takes an input signal and attempts to decode the signal to find a pattern represented by the signal. For example, in a speech recognition system, a speech signal is received by the recognition system and is decoded to identify a string of words represented by the speech signal.
However, input signals are typically corrupted by some form of additive noise. Therefore, to improve the performance of the pattern recognition system, it is often desirable to estimate the additive noise and use the estimate to provide a cleaner signal.
Spectral subtraction has been used in the past for noise removal, particularly in automatic speech recognition systems. Conventional wisdom holds that when perfect noise estimates are available, basic spectral subtraction should do a good job of removing the noise; however, this has been found not to be the case.
Standard spectral subtraction is motivated by the observation that noise and speech spectra mix linearly, and therefore, their spectra should mix according to|Y[k]|2=|X[k]|2+|N[k]|2
Typically, this equation is solved for a |X[k]|2, and a maximum attenuation floor F is introduced to avoid producing negative power special densities.
                                                                                    X                ^                            ⁡                              [                k                ]                                                          2                =                                                                          Y                ⁡                                  [                  k                  ]                                                                    2                    ⁢                      max            ⁡                          (                                                                                                                                                                    Y                          ⁡                                                      [                            k                            ]                                                                                                                      2                                        -                                                                                                                    N                          ⁡                                                      [                            k                            ]                                                                                                                      2                                                                                                                                                Y                        ⁡                                                  [                          k                          ]                                                                                                            2                                                  ,                F                            )                                                          EQ        .                                  ⁢        1            
Several experiments were run to examine the performance of Equation 1 using the true spectra of n, and floors F from e−20 to e−2. The true noise spectra were computed from the true additive noise time series for each utterance. All experiments were conducted using the data, code and training scripts provided within the Aurora 2 evaluation framework described by H. G. Hirsch and D. Pearce in “The Aurora Experimental Framework for the Performance Evaluations of Speech Recognition Systems Under Noisy Conditions,” ISCA ITRW ASR 2000 “Automatic Speech Recognition: Challenges for the Next Millennium”, Paris, France, September 2000. The following digit error rates were found for various floors:
FLOORe−20e−10e−5e−3e−287.5056.0034.5411.3115.56
From the foregoing, it is clear that even when the noise spectra is known exactly, spectral subtraction does not perform perfectly and improvements can be made. In light of this, a noise removal technique is needed that is more effective at estimating the clean speech spectral features.