1. Field of the Invention
This invention relates generally to digital communications, and more particularly, to digital coding (or compression) of speech and/or audio signals.
2. Related Art
In speech or audio coding, the coder encodes the input speech or audio signal into a digital bit stream for transmission or storage, and the decoder decodes the bit stream into an output speech or audio signal. The combination of the coder and the decoder is called a codec.
In the field of speech coding, the most popular encoding method is predictive coding. Rather than directly encoding the speech signal samples into a bit stream, a predictive encoder predicts the current input speech sample from previous speech samples, subtracts the predicted value from the input sample value, and then encodes the difference, or prediction residual, into a bit stream. The decoder decodes the bit stream into a quantized version of the prediction residual, and then adds the predicted value back to the residual to reconstruct the speech signal. This encoding principle is called Differential Pulse Code Modulation, or DPCM. In conventional DPCM codecs, the coding noise, or the difference between the input signal and the reconstructed signal at the output of the decoder, is white. In other words, the coding noise has a flat spectrum. Since the spectral envelope of voiced speech slopes down with increasing frequency, such a flat noise spectrum means the coding noise power often exceeds the speech power at high frequencies. When this happens, the coding distortion is perceived as a hissing noise, and the decoder output speech sounds noisy. Thus, white coding noise is not optimal in terms of perceptual quality of output speech.
The perceptual quality of coded speech can be improved by adaptive noise spectral shaping, where the spectrum of the coding noise is adaptively shaped so that it follows the input speech spectrum to some extent. In effect, this makes the coding noise more speech-like. Due to the noise masking effect of human hearing, such shaped noise is less audible to human ears. Therefore, codecs employing adaptive noise spectral shaping gives better output quality than codecs giving white coding noise.
In recent and popular predictive speech coding techniques such as Multi-Pulse Linear Predictive Coding (MPLPC) or Code-Excited Linear Prediction (CELP), adaptive noise spectral shaping is achieved by using a perceptual weighting filter to filter the coding noise and then calculating the mean-squared error (MSE) of the filter output in a closed-loop codebook search. However, an alternative method for adaptive noise spectral shaping, known as Noise Feedback Coding (NFC), had been proposed more than two decades before MPLPC or CELP came into existence.
The basic ideas of NFC date back to C. C. Cutler in a U.S. Patent entitled “Transmission Systems Employing Quantization,” U.S. Pat. No. 2,927,962, issued Mar. 8, 1960. Based on Cutler's ideas, E. G. Kimme and F. F. Kuo proposed a noise feedback coding system for television signals in their paper “Synthesis of Optimal Filters for a Feedback Quantization System,” IEEE Transactions on Circuit Theory, pp. 405-413, September 1963. Enhanced versions of NFC, applied to Adaptive Predictive Coding (APC) of speech, were later proposed by J. D. Makhoul and M. Berouti in “Adaptive Noise Spectral Shaping and Entropy Coding in Predictive Coding of Speech,” IEEE Transactions on Acoustics, Speech, and Signal Processing, pp. 63-73, February 1979, and by B. S. Atal and M. R. Schroeder in “Predictive Coding of Speech Signals and Subjective Error Criteria,” IEEE Transactions on Acoustics, Speech, and Signal Processing, pp. 247-254, June 1979. Such codecs are sometimes referred to as APC-NFC. More recently, NFC has also been used to enhance the output quality of Adaptive Differential Pulse Code Modulation (ADPCM) codecs, as proposed by C. C. Lee in “An enhanced ADPCM Coder for Voice Over Packet Networks,” International Journal of Speech Technology, pp. 343-357, May 1999.
In noise feedback coding, the difference signal between the quantizer input and output is passed through a filter, whose output is then added to the prediction residual to form the quantizer input signal. By carefully choosing the filter in the noise feedback path (called the noise feedback filter), the spectrum of the overall coding noise can be shaped to make the coding noise less audible to human ears. Initially, NFC was used in codecs with only a short-term predictor that predicts the current input signal samples based on the adjacent samples in the immediate past. Examples of such codecs include the systems proposed by Makhoul and Berouti in their 1979 paper. The noise feedback filters used in such early systems are short-term filters. As a result, the corresponding adaptive noise shaping only affects the spectral envelope of the noise spectrum. (For convenience, we will use the terms “short-term noise spectral shaping” and “envelope noise spectral shaping” interchangeably to describe this kind of noise spectral shaping.)
In addition to the short-term predictor, Atal and Schroeder added a three-tap long-term predictor in the APC-NFC codecs proposed in their 1979 paper cited above. Such a long-term predictor predicts the current sample from samples that are roughly one pitch period earlier. For this reason, it is sometimes referred to as the pitch predictor in the speech coding literature. (Again, the terms “long-term predictor” and “pitch predictor” will be used interchangeably.) While the short-term predictor removes the signal redundancy between adjacent samples, the pitch predictor removes the signal redundancy between distant samples due to the pitch periodicity in voiced speech. Thus, the addition of the pitch predictor further enhances the overall coding efficiency of the APC systems. However, the APC-NFC codec proposed by Atal and Schroeder still uses only a short-term noise feedback filter. Thus, the noise spectral shaping is still limited to shaping the spectral envelope only.
In their paper entitled “Techniques for Improving the Performance of CELP-Type Speech Coders,” IEEE Journal on Selected Areas in Communications, pp. 858-865, June 1992, I. A. Gerson and M. A. Jasiuk reported that the output speech quality of CELP codecs could be enhanced by shaping the coding noise spectrum to follow the harmonic fine structure of the voiced speech spectrum. (We will use the terms “harmonic noise shaping” or “long-term noise shaping” interchangeably to describe this kind of noise spectral shaping.) They achieved this goal by using a harmonic weighting filter derived from a three-tap pitch predictor. The effect of such harmonic noise spectral shaping is to make the noise intensity lower in the spectral valleys between pitch harmonic peaks, at the expense of higher noise intensity around the frequencies of pitch harmonic peaks. The noise components around the frequencies of pitch harmonic peaks are better masked by the voiced speech signal than the noise components in the spectral valleys between harmonics. Therefore, harmonic noise spectral shaping further reduces the perceived noise loudness, in addition to the reduction already provided by the shaping of the noise spectral envelope alone.
In Lee's May 1999 paper cited earlier, harmonic noise spectral shaping was used in addition to the usual envelope noise spectral shaping. This is achieved with a noise feedback coding structure in an ADPCM codec. However, due to ADPCM backward compatibility constraint, no pitch predictor was used in that ADPCM-NFC codec.
As discussed above, both harmonic noise spectral shaping and the pitch predictor are desirable features of predictive speech codecs that can make the output speech less noisy. Atal and Schroeder used the pitch predictor but not harmonic noise spectral shaping. Lee used harmonic noise spectral shaping but not the pitch predictor. Gerson and Jasiuk used both the pitch predictor and harmonic noise spectral shaping, but in a CELP codec rather than an NFC codec. Because of the Vector Quantization (VQ) codebook search used in quantizing the prediction residual (often called the excitation signal in CELP literature), CELP codecs normally have much higher complexity than conventional predictive noise feedback codecs based on scalar quantization, such as APC-NFC. For speech coding applications that require low codec complexity and high quality output speech, it is desirable to improve the scalar-quantization-based APC-NFC so it incorporates both the pitch predictor and harmonic noise spectral shaping.
The conventional NFC codec structure was developed for use with single-stage short-term prediction. It is not obvious how the original NFC codec structure should be changed to get a coding system with two stages of prediction (short-term prediction and pitch prediction) and two stages of noise spectral shaping (envelope shaping and harmonic shaping).
Even if a suitable codec structure can be found for two-stage APC-NFC, another problem is that the conventional APC-NFC is restricted to scalar quantization of the prediction residual. Although this allows the APC-NFC codecs to have a relatively low complexity when compared with CELP and MPLPC codecs, it has two drawbacks. First, scalar quantization limits the encoding bit rate for the prediction residual to integer number of bits per sample (unless complicated entropy coding and rate control iteration loop are used). Second, scalar quantization of prediction residual gives a codec performance inferior to vector quantization of the excitation signal, as is done in most modern codecs such as CELP. All these problems are addressed by the present invention.