The present invention relates generally to speech signal processing and speech recognition. More particularly, the invention relates to signal equalization for normalizing a time domain source signal to a target environment, the environment being defined by channel characteristics, background and speaker loudness (SNR).
Many consumer and industrial applications of speech technology encounter adverse conditions that can reduce reliability. Background noise can easily degrade a weak signal, making automatic speech recognition very difficult. Other sources of signal variability also contribute to the problem. For example, variability in microphone response characteristics, room reverberation, speaker variability, distance from microphone, and the like, can easily confound a speech recognizer that has been trained under more favorable conditions. In designing speech-enabled consumer and industrial applications, it is often difficult to take all of these sources of variability into account.
To appreciate the difficulty, consider the significant difference in signal quality between the audio output of a telephone handset and the audio output of a speakerphone. Electro-acoustic differences in the microphone capsules of these respective devices account for some of the difference in quality. However, environmental factors and speaker variability can account for even more. The typical speakerphone will pick up a higher level of background noise as well as reverberation of sounds within the room. Speakerphone users rarely maintain a fixed distance to the microphone, and some users even vary their speaking patterns when using a speakerphone. A system designed to work with both handset and speakerphone will need to address this wide mismatch between these input sources.
While some of sources of variability can be mitigated by careful hardware design, many cannot. In many applications the environment simply cannot be controlled and users"" speaking patterns cannot be fully anticipated. A good example is the cellular telephone. Consumers use cellular telephones in a wide range of different environments, including inside moving vehicles where road noise and wind noise is a significant problem. It is very difficult to implement robust speech technology features, such as telephone voice dialing features, in cellular telephone equipment. Other types of mobile systems, such as voice-assisted vehicle navigation systems experience similar problems.
The difficulty of implementing speech technology features in such consumer products is increased by the fact that these products have limited computational resources. Often, there is precious little memory available for storing the complex templates or models needed for robust speech recognition over a wide range of conditions.
To make matters worse, background noise, speaker variability, unstable channel effects and environmental variability degrade many other aspects of the speech system, not just the speech recognizer. For example, many systems employ some form of endpoint detection mechanism, to ascertain when the user has stopped speaking. End-of-speech serves as the xe2x80x9ccommandxe2x80x9d to begin processing the input speech. Endpoint detection is difficult; for example, a pause in mid-sentence, or even a weakly articulated word can be mistaken for the end of speech.
Prior attempts to address the problems attendant to background noise and other sources of variability have attempted to compensate for variability by manipulating the speech data in the frequency domain, or by developing large footprint speech models that are trained under a variety of adverse conditions or compensating model parameters at runtime. Such solutions have not proven effective for most consumer applications because they do not adequately address all aspects of speech processing (speech recognition, speech endpoint detection, and the like) and because they often require large amounts of memory or computation.
The present invention provides a preprocessing system and method that normalizes an audio source to one or more predetermined targets, resulting in a robust normalized audio signal in which background noise and other sources of variability are minimized. The system will minimize the mismatch that otherwise occurs between system training and system use. The system and method of the invention operates in the time domain, and in real time, while the system is being used. It may be deployed in the signal processing path upstream of speech recognition and speech detection (endpoint detection) mechanisms. Thus the normalizing effect of the invention can readily benefit all aspects of the speech processing problem.
According to one aspect of the invention, a three phase or three component normalization procedure is performed on the audio source. The audio source is filtered to spectrally shape the time domain signal to match a predefined target channel. This may be accomplished by configuring a filter based on channel parameters selected so that the spectral shape of the channel (including microphone and its attendant acoustic environment where applicable) matches a predefined standard or target channel. The effect of this filtering is to equalize the microphone and its acoustic environment to an appropriately selected standard microphone.
After equalization, the signal level of the audio source is adjusted prior to the onset of speech, to establish a background noise level or noise floor that matches or approaches a predetermined target background noise level. The target background noise level is selected with consideration given to the worst case scenario. In other words, the gain is adjusted so that the noise level prior to the onset of speech approaches the noise level of the worst case expected under noisy conditions.
Having established the background noise compensation value during the pre-speech interval, the system then calculates the noise level required to achieve a target signal-to-noise ratio. In the preferred embodiment this calculation is computed beginning from the onset of speech up to the current frame being processed, until endpoint detection is reached. The system mixes noise with the audio source to approach the target SNR by selecting the greater of the noise compensation value determined during pre-speech and the noise compensation value determined during speech. If the average signal-to-noise ratio is higher than the target value, additional noise is added or mixed with the audio source. If the average signal-to-noise ratio is below the target value no additional noise is added.
The foregoing processing steps may be performed in real time. The result is a normalized audio signal that has been manipulated to have a spectral shape, background noise level and signal-to-noise ratio that match or at least approach the predetermined target conditions. By performing the same normalization process on both training speech (for recognizer training) and on test speech (for recognizer use) the mismatch between training and testing is greatly minimized, resulting in a far more robust product.
For a more complete understanding of the invention, its objects and advantages, refer to the following specification and to the accompanying drawings.