While speech recognition by humans is very robust against stationary distortions of the speech signal introduced by the speech pickup and reproduction equipment and by the telephone channel, these distortions, effectively filtering the speech signal, may degrade the performance of automatic speech recognition systems. In order for speech to be recognized automatically, a parametric representation of the incoming speech is produced which is optimally independent, to the degree possible, of the enumerated noise sources.
The effect of noise sources such as those enumerated is convolutional rather than additive, and thus appears as an additive disturbance in the log-power domain in which each frequency band is characterized by the logarithm of an estimate of the signal power in that band. Signal analysis in log-spectral and cepstral domains is discussed in Rabiner and Juang, Fundamentals of Speech Recognition, (Prentice Hall, 1993), which is incorporated herein by reference. Convolutional noise is typically constant or slowly varying. A known technique for removal of convolutional noise, otherwise known as "channel normalization," is the removal of a mean in either the log-power domain or the cepstral domain corresponding to a further transform of the logarithm of the Fourier transform of the time-domain signal.
Typical convolution noise elimination based on mean removal entails three steps:
a. selecting signal portions containing speech to be used in calculating a mean; PA1 b. computing the mean, averaged over a time duration typically on the order of seconds to tens of seconds, of the mean power in each log-power band; PA1 c. subtracting the mean, on a band-by-band basis, from the signal in each band. PA1 a. insufficient data are available for the first few uttered words to compute the mean vector reliably; PA1 b. if the running averaging accidentally incorporates a segment not containing speech data, the mean vector is incorrectly calculated, and recovery requires a long period to accumulate a meaningful new average. PA1 a. characterizing the signal with respect to a plurality of frequency bands, where the signal has a power in each frequency band; PA1 b. computing a logarithm of a quantity characterizing the power in each frequency band over a specified interval of time for deriving a transform of the signal in a log-spectral domain; PA1 c. fitting a smoothed log-power spectrum to the logarithm of the transform of the signal in the log-spectral domain for deriving a fitted log-power spectrum corresponding to the effect of convolutional noise in the logspectral domain; and PA1 d. removing a function of the fitted log-power spectrum from the transform of the signal in the log-spectral domain. PA1 a. characterizing the signal with respect to a plurality of frequency bands, the signal having a power in each frequency band; PA1 b. computing a function of a quantity characterizing the power of the signal in each frequency band over a specified interval of time for deriving a transform of the signal in a transform domain; PA1 c. fitting a smoothed transform domain spectrum to the transform of the signal in the transform domain for deriving a fitted transform domain spectrum corresponding to the effect of convolutional noise in the transform domain; and PA1 d. removing a function of the fitted transform domain spectrum from the transform of the signal in the transform domain.
Since the mean computed for each band is a scalar, the ensemble of computed means may be viewed as a mean vector (i.e., a vector, each element of which is a mean). Mean removal of this sort may be applied in either the log-power or cepstral domains. The mean vector has a dimensionality equal to the total number of frequency bands. Thus, sufficient data must be collected to provide a number of parameters (i.e., the mean vector elements) equal to the number of vector elements. This requires that several seconds of speech are typically required before techniques of this sort may be applied with success. Such techniques are, therefore, prone to the following difficulties:
Another technique applied for convolutional noise elimination is the RASTA technique in which linear filtering with a high-pass component is performed, corresponding to subtraction of the mean cepstrum over the preceding 200 milliseconds. A disadvantage of this technique is the introduction of a context dependence due to the fact that the subtracted component depends strongly on phonemes uttered in the immediate past.
It is to be noted that additive noise is not addressed by the foregoing techniques.