Human speech consists of a stream of acoustic signals with frequencies ranging up to roughly 20 KHz; however, the band of about 100 Hz to 5 KHz contains the bulk of the acoustic energy. Telephone transmission of human speech originally consisted of conversion of the analog acoustic signal stream into an analog voltage signal stream (e.g., by using a microphone) for transmission and reconversion back to an acoustic signal stream (e.g., by using a loudspeaker). The electrical signals would be bandpass filtered to retain only the 300 Hz to 4 KHz frequency band to limit bandwidth and avoid low frequency problems. However, the advantages of digital electrical signal transmission has inspired a conversion to digital telephone transmission beginning in the 1960s. Digital telephone signals are typically derived from sampling analog signals at 8 KHz and nonlinearly quantizing the samples with 8 bit codes according to the .mu.-law (pulse code modulation, or PCM). A clocked digital-to-analog converter and companding amplifier reconstruct an analog electrical signal stream from the stream of 8-bit samples. Such signals require transmission rates of 64 Kbps (kilobits per second) and this exceeds the former analog signal transmission bandwidth.
The storage of speech information in analog format (for example, on magnetic tape in a telephone answering machine) can likewise be replaced with digital storage. However, the memory demands can become overwhelming: 10 minutes of 8-bit PCM sampled at 8 KHz would require about 5 MB (megabytes) of storage.
The demand for lower transmission rates and storage requirements has led to development of compression for speech signals. One approach to speech compression models the physiological generation of speech and thereby reduces the necessary information to be transmitted or stored. In particular, the linear speech production model presumes excitation of a variable filter (which roughly represents the vocal tract) by either a pulse train with pitch period P (for voiced sounds) or white noise (for unvoiced sounds) followed by amplification to adjust the loudness. 1/A(z) traditionally denotes the z transform of the filter's transfer function. The model produces a stream of sounds simply by periodically making a voiced/unvoiced decision plus adjusting the filter coefficients and the gain. Generally, see Markel and Gray, Linear Prediction of Speech (Springer-Verlag 1976).
To reduce the bit rate, the coefficients for successive frames may be interpolated. However, to improve the sound quality, further information may be extracted from the speech, compressed and transmitted or stored. For example, the codebook excitation linear prediction (CELP) method first analyzes a speech frame to find A(z) and filter the speech. Next, a pitch period determination is made and a comb filter removes this periodicity to yield a noise-looking excitation signal. Then the excitation signals are encoded in a codebook. Thus CELP transmits the LPC filter coefficients, the pitch, and the codebook index of the excitation.
Another approach is to mix voiced and unvoiced excitations for the LPC filter. For example, McCree, A New LPC Vocoder Model for Low Bit Rate Speech Coding, Ph.D. thesis, Georgia Institute of Technology, August 1992, divide the excitation frequency range into bands, make the voiced/unvoiced mixture decision in each band separately, and combine the results for the total excitation. A mixed excitation linear prediction (MELP) coefficient vocoder is described in an article by A. McCree, et al. entitled "A Mixed Excitation LPC Vocoder Model for Low Bit Rate Speech Coding", in IEEE Trans. on Speech and Audio Proc., Vol. 3, No. 4, July 1995. The above cited application Ser. No. 08/218,003 and 08/336,593 describe a mixed excitation linear prediction speech coder. These references are incorporated herein by reference.
Most low bit rate speech coders employ some form of adaptive spectral enhancement filter or postfilter to improve the perceived quality of the processed speech signal. For example, in the Mixed Excitation Linear Predictive (MELP) speech coder in McCree, et al. an adaptive pole/zero enhancement filter based on the LPC spectrum is used. The adaptive spectral enhancement filter helps the bandpass filtered speech to match natural speech waveforms in the format region. This adaptive filter described above improves the speech quality for clean input signals, but in the presence of acoustic noise this filter may actually degrade performance. The enhancement filter tends to increase the fluctuations in the power spectrum of the acoustic background noise, causing an unnatural "swirling" effect that can be very annoying to listeners. A similar effect takes place in the postfilter of the CELP speech coder.
In accordance with one object of the present invention an improvement is provided to this adaptive spectral enhancement filter or postfilter in CELP which results in better performance in the presence of acoustic noise while maintaining the quality improvement of the existing method for clean speech signals.