Background noise during speech can degrade voice communications. The listener might not be able to understand what is being transmitted, and is aggravated by trying to identify and interpret speech while noise is present. Also, in speech recognition systems, errors occur more frequently as the level of background (or ambient) noise increases.
Substantial efforts have been made to reduce the level of ambient noise in communications systems on a real-time basis. One is to filter out the low and high bands at the extremes of the voice band. The problem with this is that much noise is located in the same frequencies as usable speech.
Another is to actively estimate the noise and filter it out of the associated speech. This is generally done by quantifying the signal when speech is not present (presumed to be representative of ambient noise), and subtracting out that signal during speech. If the ambient noise is consistent between periods of speech and periods of non-speech, then such cancellation techniques can be very effective.
A typical state-of-the-art noise cancellation (speech enhancement) system generally has three components:
Speech/Noise Detector PA1 Noise Estimator PA1 Noise Canceller
A standard speech enhancement system might typically operate as follows:
The input signal is sampled and converted to digital values, called "samples". These samples are grouped into "frames" whose duration is typically in the range of 10 to 30 milliseconds each. An energy value is then computed for each such frame of the input signal.
A typical state-of-the-art Speech/Noise Detector is often accomplished via a software implementation on a general purpose computer. The system can be implemented to operate on incoming frames of data by classifying each input frame as ambient noise if the frame energy is below an energy threshold, or as speech if the frame energy is above the threshold. An alternative would be to analyze the individual frequency components of the signal in relation to a template of noise components. Other variations of the above scheme are also known, and may be implemented.
The Speech/Noise Detector is initialized by setting the threshold to some pre-set value (usually based on a history of empirically observed energy levels of representative speech and ambient noise). During operation, as the frames are classified, the threshold can be adjusted to reflect the incoming frames, thereby creating a better discrimination between speech and noise.
A typical state-of-the-art Noise Estimator is then utilized to form a quantitative estimate of the signal characteristics of the frame (typically described by its frequency components). This noise estimate is also initialized at the beginning of the input signal and then updated continuously during operation, as more noise signals are received. If a frame is classified as noise by the Speech/Noise Detector, that frame is used to update the running estimate of noise. Typically, the more recent frames of noise received are given greater weight in the computation of the noise estimate.
The Noise Canceller component of the system takes the estimate of the noise from the Noise Estimator, and subtracts it from the signal. A state-of-the-art cancellation method is that of "spectral subtraction", where the subtraction is performed on the frequency components of the signal. This may be accomplished using both linear and non-linear means.
Effectiveness of the overall noise-cancellation system in enhancing the signal, i.e. enhancing the speech, is critically dependent on the noise estimate; a poor or inappropriate estimate will result in the benign error of negligible enhancement, or the malign error of degradation of the speech.
One of the problems with existing speech enhancement systems utilizing noise cancellation relates to the "hands-free" telephony environment.
Typically, in the case of hands-free telephony, such as speaker phones and "hands-free" mobile phones, squelch is incorporated into the telephone when no speech is being input into the microphone, for purposes of reducing echo. This is typically accomplished by attenuating the microphone signal until a pre-determined level of energy is detected at the microphone. The use of squelch results in a very low-level, uniform noise signal at the far end, generally representative of noise on the line, rather than ambient noise near the microphone. Consequently, the noise estimate obtained from a Noise Estimator prior to speech onset (when squelch is present) will not describe target noise, since squelch is not active during speech. A different ambient noise will be present during speech (target noise), and shortly thereafter, until the squelch is re-introduced.
Current noise-cancellation systems will therefore utilize the "squelch" noise sample to subtract from the first speech utterance, which will not be representative of the target noise (noise during speech). Further, the noise following speech, which is representative of the target noise, will be "averaged" with the noise prior to speech in order to arrive at the noise estimate used for cancellation purposes to be applied to subsequent speech utterances. This averaging will once again not be representative of the target noise.
This problem is exacerbated by the fact that "hands-free" operations have a great deal of ambient noise, since the microphone is not next to the speaking person's lips. The signal-to-noise ratio (SNR) for hands-free environments is poor, and can even be less than one. Therefore, in an existing system, the noise estimate used for cancellation will be based on capturing a very low, uniform (squelched) noise signal, and applying that estimate to speech which has different frequency component characteristics and high energy-level (non-squelched) ambient noise, thereby rendering the system's cancellation process ineffective when it is needed most (in a noisy hands-free environment).
Also, if there is enough silence between speech utterances, the squelch will kick back in, rendering the subsequent series of noise samples likewise unrepresentative of target noise.
Additionally, in the case of isolated utterance speech recognition systems (where the input is typically a single utterance followed by silence), combined with a hands-free environment, a typical existing system would not reduce ambient noise (and might indeed introduce additional noise or degrade the speech). Where there is a single utterance, existing systems would use the pre-speech noise (squelched) and apply it to the utterance (where squelch would not be present). There would be no opportunity to measure and apply post-speech noise samples to the single utterance.
Another drawback to existing noise-reduction systems occurs in situations involving dynamically directional microphones and voice-activated microphones. In each case, the ambient noise during speech will more closely approximate the noise immediately following speech than the noise immediately preceding speech. This is due to the fact that the environment picked up by microphones for input into the system changes radically once speech begins, but does not return to the initial state until some period of time following speech. Therefore, current systems would use the unrepresentative noise prior to speech to enhance the speech, resulting in poor performance.