As taught in the literature of signal compression, speech and music waveforms are coded by very different coding techniques. Speech coding, such as telephone-bandwidth (3.4 kHz) speech coding at or below 16 kb/s, has been dominated by time-domain predictive coders. These coders use speech production models to predict speech waveforms to be coded. Predicted waveforms are then subtracted from the actual (original) waveforms (to be coded) to reduce redundancy in the original signal. Reduction in signal redundancy provides coding gain. Examples of such predictive speech coders include Adaptive Predictive Coding, Multi-Pulse Linear Predictive Coding, and Code-Excited Linear Prediction (CELP) Coding, all well known in the art of speech signal compression.
On the other hand, wideband (0-20 kHz) music coding at or above 64 kb/s has been dominated by frequency-domain transform or sub-band coders. These music coders are fundamentally very different from the speech coders discussed above. This difference is due to the fact that the sources of music, unlike those of speech, are too varied to allow ready prediction. Consequently, models of music sources are generally not used in music coding. Instead, music coders use elaborate human hearing models to code only those parts of the signal that are perceptually relevant. That is, unlike speech coders which commonly use speech production models, music coders employ hearing--sound reception--models to obtain coding gain.
In music coders, hearing models are used to determine a noise masking capability of the music to be coded. The term "noise masking capability" refers to how much quantization noise can be introduced into a music signal without a listener noticing the noise. This noise masking capability is then used to set quantizer resolution (e.g., quantizer stepsize). Generally, the more "tonelike" music is, the poorer the music will be at masking quantization noise and, therefore, the smaller the required quantizer stepsize will be, and vice versa. Smaller stepsizes correspond to smaller coding gains, and vice versa. Examples of such music coders include AT&T's Perceptual Audio Coder (PAC) and the ISO MPEG audio coding standard.
In between telephone-bandwidth speech coding and wideband music coding, there lies wideband speech coding, where the speech signal is sampled at 16 kHz and has a bandwidth of 7 kHz. The advantage of 7 kHz wideband speech is that the resulting speech quality is much better than telephone-bandwidth speech, and yet it requires a much lower bit-rate to code than a 20 kHz audio signal. Among those previously proposed wideband speech coders, some use time-domain predictive coding, some use frequency-domain transform or sub-band coding, and some use a mixture of time-domain and frequency-domain techniques.
The inclusion of perceptual criteria in predictive speech coding, wideband or otherwise, has been limited to the use of a perceptual weighting filter in the context of selecting the best synthesized speech signal from among a plurality of candidate synthesized speech signals. See, e.g., U.S. Pat. No. Re. 32,580 to Atal et al. Such filters accomplish a type of noise shaping which is useful in reducing noise in the coding process. One known coder attempts to improve upon this technique by employing a perceptual model in the formation of that perceptual weighting filter. See W. W. Chang et al., "Audio Coding Using Masking-Threshold Adapted Perceptual Filter," Proc. IEEE Workshop Speech Coding for Telecomm., pp. 9-10, October 1993.