1. Field of the Invention
The invention relates to a method of processing speech mixed with noise that are concurrently detected by a microphone in a noisy environment. In many situations where communication with machines by voice using automatic speech recognition would be desirable, the application of speech recognition technology is unsuccessful because the background noise interferes with the operation of the speech recognition system. Examples of such situations are helicopters, airplanes, battle tanks, automobiles, factories, postal centers and baggage handling centers. This invention also has potential application to a class of devices known as "channel vocoders" which are used for human-to-human communications and which often need to operate in noisy conditions.
2. Description of the Prior Art
Almost all speech recognition systems carry out an acoustic analysis to derive (typically every 10 ms) a "frame" consisting of an estimate of the smoothed short-term power spectrum of the input signal. Such frames are almost always computed using either linear prediction or a bank of band-pass filters. The noise reduction technique described in this invention applies primarily to the latter kind of analysis.
One method of reducing the background noise added to a speech signal in a noisy environment is to use a noise-cancelling microphone. Such an approach, while a useful contribution, is often not enough in itself. It is complementary to the techniques described in this invention, and can be used freely in combination with them.
The remaining methods involve processing the signal, usually in digitized form. These methods can be classified by two criteria: whether they use a single or multiple microphones, and whether they operate on the acoustic waveform or on the short-term power spectrum. This classification results in four possible combinations, and all four have been tried.
Single-microphone waveform-based methods have been tried. They are effective at removing steady or slowly-changing tones, but they are much less effective at removing rapidly changing tones or atonal interference such as helicopter rotor noise.
Single-microphone spectrum-based methods have also been tried. They assume that the noise spectrum is stationary over periods when speech may be present. In one method, the noise spectrum is estimated over a period when there is no speech and then subtracted from the speech spectrum. In another method, the noise spectrum is used to identify frequency bands which will be ignored because they contain a noise level higher than the speech level in the incoming speech or in the particular frame of reference speech against which the incoming speech is being compared.
Multiple-microphone waveform-based methods have also been tried, and with two variations. In the first method, the microphones are used as a phased array to give enhanced response in the direction of the speaker. This, like the use of a noise-cancelling microphone, is an approach that can be combined with the invention described here.
In the second multiple-microphone waveform-based method, which is closely related to the present invention, one microphone (the "speech microphone") collects the speech plus the noise and the other (the "reference microphone") aims to collect only the noise. The noise waveform at the two microphones will, in general, be different, but it is assumed that an appropriate filter (one example being a finite-impulse-response ("FIR") filter) can be used to predict the noise waveform at the speech microphone from the noise waveform at the reference microphone. That is, si, the i'th sample of the noise waveform at the speech microphone is approximated by: ##EQU1## where ri is the i'th sample of the noise waveform at the reference microphone and wj is the j'th coefficient of the FIR filter of length L. Adaptive two-channel filtering methods can then be used to design the FIR filter, provided that its characteristics are changing only slowly. The method requires adaptively determining the values of the coefficients in the FIR filter that will minimize the mean-square error between the actual and predicted values of the noise waveform at the speech microphone; that is, the method requires minimizing &lt;e.sub.i 2&gt; where EQU e.sub.i =s.sub.i -s.sub.i.
This second multiple-microphone waveform-based method works well with single sources of noise, such as a single loudspeaker, but has not been found to be effective with multiple, distributed time-varying noise sources of the kind occurring in aircraft and in many other noisy environments. As an example of the problem faced by this method, consider the situation where the waveform sampling rate is 10 kHz so that the separation in time between adjacent taps in the filter is 0.1 ms. In this time a sound wave in air travels about one-tenth of an inch, so that if the relative distance between the source of the two microphones changes by even that small distance the filter coefficients will be out by one position. If the filter was accurately cancelling a component in the noise at 5 kHz before the source moved, it will quadruple the interfering noise power at that frequency after the source moved one-tenth of an inch.
Two-microphone spectrum-based methods have also been tried, although not widely reported. If the relationship between the power spectrum at the speech microphone and the power spectrum at the reference microphone can be described by a single linear filter whose characteristics change only slowly, then the noise spectrum at the speech microphone can be predicted from the noise spectrum at the reference microphone as EQU S.sub.ik =.alpha..sub.k -R.sub.ik
where S.sub.ik and R.sub.ik represent the noise power in the i'th frame and the k'th frequency band for the speech and reference signals respectively. That predicted value of the noise power in the speech channel can be exploited as in the single-microphone spectrum-based method. The advantage of the two-microphone method is that the noise intensity and the shape of the noise spectrum can change during the speech. However, the relationship between the two noise spectra would be determined during a period when there is no speech and must remain constant during the speech.
The limitations of the present art can be summarized as follows. Single-microphone methods operating on either the waveform or the spectrum cannot deal effectively with rapidly time-varying noise. Multiple-microphone methods operating on the waveform cannot deal effectively with moving noise sources. Current dual microphone methods operating on the spectrum cannot deal effectively with multiple noise sources whose effect at the two microphones is different.
The present invention discloses a variation of the two-microphone method operating on the spectrum. It differs from previous methods in using an adaptive least-squares method to estimate the noise power spectrum in the signal from the speech microphone from a time-sequence of values of noise power spectrum in the signal from the reference microphone. Such adaptive least squares methods have previously been applied only to waveforms, not to power spectra.
Previous methods for estimating a noise power spectrum directly have either assumed it to be constant and taken an average from the speech microphone over a period when speech is absent, or have used single noise values from a reference microphone rather than taking linear combinations of sequences of such values.