1. Field of the Invention
The present invention relates to speech recognition systems. More particularly, the invention is directed to a system and method for normalizing a voice signal in a speech pre-processor for input to a speech recognition system.
2. Discussion of the Prior Art
A well recognized goal of speech recognition systems (hereinafter SR systems) is that of normalizing the voice signal to be processed, including its energy. Normalizing a voice signal enables successful comparison of the unknown spoken information with stored patterns or models. The process of energy normalization generally involves removing the long term variations and bias in the energy of the voice signal while retaining the short term variations that represent the phonetic information. The process of energy normalization enhances the accuracy of the SR system in proportion to the degree of normalization applied.
The undesirable long term variations in the energy of a voice signal can typically arise from multiple sources. A common source of energy variation comes from variations in microphone gain and placement. Current SR systems are very sensitive to variations in both the microphone gain and placement. Improper gain and/or placement result in higher error rates. At present, the only way to accommodate the SR system is to use an offline microphone setup to set the gain. This, however, presents several disadvantages. First, it is an added burden on the user. Second, it does not measure the audio quality on-line, and so does not detect changes that happened since the setup. Third, it does not measure the feature that is most relevant to the SR system: the instantaneous signal to noise ratio.
Additional contributing factors to energy variation, a which leads to higher error rates, include the intensity of a speaker's voice which will typically exhibit a large dynamic range. A further general problem is that different speakers will have different volume levels. Thus, the variations in amplitude or energy that occur between different utterances of the same word or sentence by different speakers, or even by the same speaker at different times, must be eliminated or at least reduced.
In the prior art, hardware solutions in the form of automatic gain controls have been used on sound cards to achieve energy normalization of raw signals. However, the degree of normalization provided by such cards has proven to be inadequate for the purposes of speech recognition.
The use of an unbiased mean value has also been used in the prior art, however, since the relative amounts of speech, silence, and noise contained within the signal is not known in advance an unbiased mean value is not a reliable norm. The peak value of the energy provides a more reliable norm, however, there is an associated drawback in tracking peak energy in that the system may suffer from being too sensitive to the instantaneous variations in energy. It is therefore desirable to have a reliable indicator of peak energy without being overly sensitive to peak energy variations.
A further general problem associated with energy normalization is that of silence detection. The signal energy is not a good indicator of silent periods because of background static. Static on one system could be at the level of speech on another system. Having no control over the sound cards and microphones that are used, it is therefore desirable to have some alternate measure of the silence level.