Advances in digital networks like ISDN (Integrated Services Digital Network) have rekindled interest in teleconferencing and in the transmission of high quality image and sound. In an age of compact discs and high-definition television, the trend toward higher and higher fidelity has come to include the telephone as well.
Aside from pure listening pleasure, there is a need for better sounding telephones, especially in the business world. Traditional telephony, with its limited bandwidth of 300-3400 Hz for transmission of narrowband speech, tends to strain the listeners over the length of a telephone conversation. Wideband speech in the 50-7000 Hz range, on the other hand, offers the listener more presence (by reason of transmission and reception of signals in the 50-300 Hz range) and more intelligibility (by reason of transmission and reception of signals in the 3000-7000 Hz range) and is easily tolerated over long periods. Thus, wideband speech is a natural choice for improving the quality of telephone service.
In order to transmit speech (either wideband or narrowband) over the telephone network, an input speech signal, which can be characterized as a continuous function of a continuous time variable, must be converted to a digital signal--a signal that is discrete in both time and amplitude. The conversion is a two step process. First, the input speech signal is sampled periodically in time (i.e., at a particular rate) to produce a sequence of samples where the samples take on a continuum of values. Then the values are quantized to a finite set of values, represented by binary digits (bits), to yield the digital signal. The digital signal is characterized by a bit rate, i.e., a specified number of bits per second that reflects how often the input signal was sampled and many bits were used to quantize the sampled values.
The improved quality of telephone service made possible through transmission of wideband speech, unfortunately, typically requires higher bit rate transmission unless the wideband signal is properly coded, i.e., such that the wideband signal can be significantly compressed into representation by fewer number of bits without introducing obvious distortion due to quantization errors. Recently some coders of high-fidelity speech and audio have relied on the notion that mean-squared-error measures of distortion (e.g., measures of the energy difference between a signal and the signal after coding and decoding) do not necessarily describe the perceived quality of the coded waveform--in short, not all kinds of distortion are equally perceptible. M. R. Schroeder, B. S. Atal and J. L. Hall, "Optimizing Digital Speech Coders by Exploiting Masking Properties of the Human Ear," J. Acous. Soc. Am., vol. 66, 1647-1652, 1979. For example, the signal-to-noise ratio between s(t) and -s(t) is -6 dB, and yet the ear cannot distinguish the two signals. Thus, given some knowledge of how the auditory system tolerates different kinds of noise, it has been possible to design coders that minimize the audibility--though not necessarily the energy--of quantization errors. More specifically, these recent coders exploit a phenomenon of the human auditory system known as masking.
Auditory masking is a term describing the phenomenon of human hearing whereby one sound obscures or drowns out another. A common example is where the sound of a car engine is drowned out if the volume of the car radio is high enough. Similarly, if one is in the shower and misses a telephone call, it is because the sound of the shower masked the sound of the telephone ring; if the shower had not been running, the ring would have been heard. In the case of a coder, noise introduced by the coder ("coder" or "quantization" noise) is masked by the original signal, and thus perceptually lossless (or transparent) compression results when the quantization noise is shaped by the coder so as to be completely masked by the original signal at all times. Typically, this requires that the coding noise have approximately the same spectral shape as the signal since the amount of masking in a given frequency band depends roughly on the amount of signal energy in that band. P. Kroon and B. S. Atal, "Predictive Coding of Speech Using Analysis-by-Synthesis Techniques," in Advances in Speech Signal Processing (S. Furui and M. M. Sondhi, eds.) Marcel Dekker, Inc., New York, 1992.
Until now there have been two distinct approaches to perceptually lossless compression, corresponding respectively to two commercially significant audio sources and their different characteristics--compact disc/high-fidelity music and wideband (50-7000 Hz) speech. High-fidelity music, because of its greater spectral complexity, has lent itself well to a first approach using transform coding strategies. J. D. Johnston, "Transform Coding of Audio Signals Using Perceptual Criteria," IEEE J. Sel. Areas in Comm., 314-323, June 1988; B. S. Atal and M. R. Schroeder, "Predictive Coding of Speech Signals and Subjective Error Criteria," IEEE Trans. ASSP, 247-254, June 1979. In the speech processing arena, by contrast, a second approach using time-based masking schemes, e.g. code-excited linear predictive coding (CELP) and low-delay CELP (LD-CELP) has proved successful. E. Ordentlich and Y. Shoham, "Low Delay Code-Excited Linear Predictive Coding of Wideband Speech at 32 Kbps," Proc. ICASSP, 1991; J. H. Chen, "A Robust, Low-Delay CELP Speech Coder at 16 Kb/s," GLOBECOM 89, vol. 2, 1237-1240, 1989.
The two approaches rely on different techniques for shaping quantization noise to exploit masking effects. Transform coders use a technique in which for every frame of an audio signals, a coder attempts to compute a priori the perceptual threshold of noise. This threshold is typically characterized as a signal-to-noise ratio where, for a given signal power, the ratio is determined by the level of noise power added to the signal that meets the threshold. One commonly used perceptual threshold, measured as a power spectrum, is known as the just-noticeable difference (JND) since it represents the most noise that can be added to a given frame of audio without introducing noticeable distortion. The perceptual threshold calculation, described in detail in Johnston, supra, relies on noise masking models developed by Schroeder, supra, by way of psychoacoustic experiments. Thus, the quantization noise in JND-based systems is closely matched to known properties of the ear. Frequency domain or transform coders can use JND spectra as a measure of the minimum fidelity--and therefore the minimum number of bits--required to represent each spectral component so that the coded result cannot be distinguished from the original.
Time-based masking schemes involving linear predictive coding have used different techniques. The quantization noise introduced by linear predictive speech coders is approximately white, provided that the predictor is of sufficiently high order and includes a pitch loop. B. Scharf, "Complex Sounds and Critical Bands," Psychol. Bull., vol. 58, 205-217, 1961; N. S. Jayant and P. Noll, Digital Coding of Waveforms, Prentice-Hall, Englewood Cliffs, N.J., 1984. Because speech spectra are usually not flat, however, this distortion can become quite audible in inter-formant regions or at high frequencies, where the noise power may be greater than the speech power. In the case of wideband speech, with its extreme spectral dynamic range (up to 100 dB), the mismatch between noise and signal leads to severe audible defects.
One solution to the problems of time-based masking schemes is to filter the signal through a noise weighting (or perceptual whitening) filter designed to match the spectrum of the JND. In current CELP systems, the noise weighting filter is derived mathematically from the system's linear predictive code (LPC) inverse filter in such a way as to concentrate coding distortions in the formant regions where the speech power is greater. This solution, although leading to improvements in actual systems, suffers from two important inadequacies. First, because the noise weighting filter depends directly on the LPC filter, it can only be as accurate as the LPC analysis itself. Second, the spectral shape of the noise weighting filter is only a crude approximation to the actual JND spectrum and is divorced from any particular relevant knowledge like psychoacoustic models or experiments.