Speech and speaker recognition devices must often operate on speech signals corrupted by noise and channel distortions. This is the case, for example, when using "far-field" microphones placed on a desktop near computers or other office equipment. Noise, such as noise originating from disk drives or cooling fans can be transmitted both mechanically, by direct contact of the microphone to the computer equipment or through the furniture it rests on, and by acoustic transmission through the air. Noise can also be picked up through electrical or magnetic coupling as in the case of power line "hum".
The "channel" through which speech is measured includes the processes of acoustic propagation from the speaker's mouth, transduction by the microphone, analog signal processing, and analog-to-digital conversion. The distortion introduced by this composite channel may be modeled as a linear process and characterized by its frequency response. Factors affecting the channel frequency response include microphone type, distance and off-axis angle of the speaker relative to the microphone, room acoustics, and the characteristics of the analog electronic circuits and anti-aliasing filter.
Speech and speaker recognition systems operate by comparing the input speech with acoustic models derived from prior "training" speech material. Loss of accuracy occurs when the input speech is corrupted by noise or channel frequency response that differ significantly from those affecting the training speech. The present invention addresses this problem by suppressing noise and equalizing channel distortions in an input speech signal.
Certain methods for noise suppression are well known. One method used for noise suppression is known as spectral subtraction (SS). SS requires an estimate of the noise magnitude spectrum, which is assumed to be stationary over time. This estimate is subtracted from the measured magnitude spectrum of a noisy speech input at each time interval or "frame" to obtain an estimate of the magnitude spectrum of the speech in the absence of noise. Further details regarding noise suppression may be obtained from the publication entitled "Suppression of acoustic noise in speech using spectral subtraction," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-27, no. 2, pp. 113-120, IEEE, New York, N.Y., 1979, and incorporated herein by reference.
Certain methods which operate to perform channel equalization are also known. One method used for channel equalization, known as blind deconvolution (BD), estimates the spectrum of the input signal over its whole duration and applies a linear filter designed to make the spectrum of the signal equal to the long term spectrum of speech. This method effectively compensates for the channel when the input speech material is of sufficient length that its spectrum approximates the long-term spectrum of speech. Further details regarding Blind Deconvolution will be obtained from the publication by T. G. Stockham, T. M. Cannon, and R. B. Ingebretsen, entitled "Blind deconvolution through digital signal processing," Proceedings of the IEEE, vol. 63, No. 4 pp. 678-692, 1975, incorporated herein by reference.
In addition, a publication by D. Hardt and K. Fellbaum, entitled "Spectral Subtraction and RASTA Filtering in Text-Dependent HMM-Based Speaker Verification", IEEE Doc. No. 0-8186-7919-0/97, p ICASSP 97, Munich, Germany, April, 1997 and incorporated by reference herein describes a comparison of speaker verification performance using "internal" versus "external" spectral subtraction. Internal SS, integrated with an existing verifier front end system, was found to be inferior to external SS, which was implemented as an independent processing step, prior to input to the verifier. Using external SS, verification accuracy was found to improve with increasing spectral analysis window size up to 128 milliseconds. Such findings were confirmed in a set of experiments involving the SpeakerKey voice verifier system described in commonly assigned copending patent application Ser. No. 08/960,509 entitled "VOICE AUTHENTICATION SYSTEM" filed on Oct. 29, 1997 to Blais et al, and incorporated herein by reference, and a specially-collected database using far-field microphones. In our experiments, the improvement with increasing window size was found to be related to the nature of the noise. The loudest noise components in the data are stationary, narrow bandwidth spectral lines, for which estimation accuracy increases with window length. High spectral resolution is therefore needed to reject this type of noise. Analysis windows of 128 ms length are sufficient to provide the needed resolution.
In another publication by C. Avendano and H. Hermansky entitled "On the Effects of Short-Term Spectrum Smoothing in Channel Normalization", 5, p. 372, IEEE Transactions on Speech and Audio Processing, vol. 5, No. 4, July, 1997, an improvement to the performance of blind deconvolution was reported in the context of a speech recognition system. The system used measurements of the power spectrum in critical bands, where each such measurement was derived by integrating the fast Fourier transform (FFT) power spectrum over frequencies within the critical band. BD was reported to perform better when applied prior to critical-band integration (i.e., to the FFT power spectrum) than after (to the critical band measurements). The disparity of performance was greatest for channels whose magnitude response varies for channels whose magnitude response varies within the frequency limits of the individual critical band filters. In the present invention, it was found that increasing the window size from 20 ms (typically used in speech and speaker recognition systems) to 128 ms led to additional performance improvements. The reason for this improvement is similar to that offered above in connection with narrow bandwidth noise. It is known that reverberant environments can introduce sharp spectral nulls (as narrow as 10 Hz in width) in the frequency response of acoustic transmission from the talker to the microphone caused by interference between direct and reflected signal paths. These effects cannot be adequately compensated if BD is applied to critical bands, whose bandwidths greatly exceed 10 Hz. When applied before critical band integration, spectral nulls present in the channel can be resolved if sufficiently long analysis windows are used. Windows of at least 100 ms length are required to provide the needed 10 Hz frequency resolution.
However, none of the prior art applications combines noise suppression with channel equalization, including channel frequency response normalization and signal level normalization to a signal preprocessor apparatus which accepts as input a noisy speech signal such as that introduced from a microphone and which produces an enhanced output speech signal for subsequent processing.