In speech recognition systems it is important that the speech signals are normalized to enable successful comparison of the unknown spoken information with stored patterns or models. Thus, the variations in amplitude or energy that occur between different utterances of the same word or sentence by different speakers, or even by the same speaker at different times, must be eliminated or at least reduced.
A common source of variation, both between speakers and for a single speaker over time, is due to changes in the glottal waveform of vowels, and the energy in high-frequency fricatives. To normalize this variation, appropriate filtering may be employed.
In an article by H. F. Silverman et al. entitled "A Parametrically Controlled Spectral Analysis System for Speech", IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP-22, No. 5, October 1974, pp. 362-381, amplitude calibration or normalization is described. For normalizing above-mentioned variations, the article suggests filtering the input speech with a linear filter matched to the long-term speech spectrum. The problem with this solution is that such a filter will distort the spectrum of silence, while normalizing the vowel and fricative spectra. Though the handling of silence in speech signal processing is also briefly addressed, no solution for overcoming distortion in the spectrum of silence that will occur during normalization is disclosed.
U.S. Pat. No. 4,060,694 to Suzuki et al. entitled "Speech Recognition Method and Apparatus Adapted to a Plurality of Different Speakers" also deals with the normalization of the sound pressure level of an input speech signal. However, the problems caused by the existence of intervals of silence in the speech signals are not addressed.