Transmission of voice by digital techniques has become widespread, particularly in long distance telephony, packet-switched telephony such as Voice over IP (VoIP), and digital radio telephony such as cellular telephony. Such proliferation has created interest in determining the least amount of information that can be sent over a channel while maintaining the perceived quality of the reconstructed speech. If speech is transmitted by simply sampling and digitizing, a data rate on the order of sixty-four kilobits per second (kbps) may be required to achieve a speech quality comparable to that of a conventional analog wireline telephone. However, through the use of speech analysis, followed by the appropriate coding, transmission, and resynthesis at the receiver, a significant reduction in the data rate can be achieved.
Devices that are configured to compress speech by extracting parameters that relate to a model of human speech generation are called “speech coders.” A speech coder typically includes an encoder and a decoder. The encoder divides the incoming speech signal into blocks of time (or “frames”), analyzes each frame to extract certain relevant parameters, and quantizes the parameters into a binary representation, such as a set of bits or a binary data packet. The data packets are transmitted over the communication channel (i.e., a wired or wireless network connection) to a receiver including a decoder. The decoder receives and processes data packets, unquantizes them to produce the parameters, and recreates speech frames using the unquantized parameters.
The function of the speech coder is to compress the digitized speech signal into a low-bit-rate signal by removing natural redundancies that are inherent in speech. The digital compression is achieved by representing the input speech frame with a set of parameters and employing quantization to represent the parameters with a set of bits. If the input speech frame has a number of bits Ni, and the corresponding data packet produced by the speech coder has a number of bits No, the compression factor achieved by the speech coder is Cr=Ni/No. The challenge is to retain high voice quality of the decoded speech while achieving the target compression factor. The performance of a speech coder depends on (1) how well the speech model, or the combination of the analysis and synthesis process described above, performs, and (2) how well the parameter quantization process is performed at the target bit rate of No bits per frame. The goal of the speech model is thus to capture the information content of the speech signal, to provide a target voice quality, with a small set of parameters for each frame.
Speech coders may be implemented as time-domain coders, which attempt to capture the time-domain speech waveform by employing high-time-resolution processing to encode small segments of speech (typically five-millisecond (ms) subframes) at a time. For each subframe, a high-precision representative from a codebook space is found by means of various search algorithms known in the art. Alternatively, speech coders may be implemented as frequency-domain coders, which perform an analysis process to capture the short-term speech spectrum of the input speech frame with a set of parameters and employ a corresponding synthesis process to recreate the speech waveform from the spectral parameters. The parameter quantizer preserves the parameters by representing them with stored representations of code vectors in accordance with known quantization techniques, such as those described in A. Gersho & R. M. Gray, Vector Quantization and Signal Compression (1992).
A well-known time-domain speech coder is the Code Excited Linear Predictive (CELP) coder. One example of such a coder is described in L. B. Rabiner & R. W. Schafer, Digital Processing of Speech Signals 396-453 (1978). In a CELP coder, the short-term correlations, or redundancies, in the speech signal are removed by a linear prediction (LP) analysis, which finds the coefficients of a short-term formant filter. Applying the short-term prediction filter to the incoming speech frame generates an LP residue signal, which is further modeled and quantized with long-term prediction filter parameters and a subsequent stochastic codebook. Thus, CELP coding divides the task of encoding the time-domain speech waveform into the separate tasks of encoding of the LP short-term filter coefficients and encoding the LP residue. Time-domain coding can be performed at a fixed rate (i.e., using the same number of bits No for each frame) or at a variable rate (in which different bit rates are used for different types of frame contents). Variable-rate coders attempt to use only the amount of bits needed to encode the codec parameters to a level adequate to obtain a target quality. An exemplary variable-rate CELP coder is described in U.S. Pat. No. 5,414,796 (Jacobs et al., issued May 9, 1995).
Time-domain coders such as the CELP coder typically rely upon a high number of bits No per frame to preserve the accuracy of the time-domain speech waveform. Such coders typically deliver excellent voice quality, provided the number of bits No per frame is relatively large (e.g., 8 kbps or above), and are successfully deployed in higher-rate commercial applications. However, at low bit rates (4 kbps and below), a time-domain coder may fail to retain high quality and robust performance due to the limited number of available bits. For example, the limited codebook space available at a low bit rate may clip the waveform-matching capability of a conventional time-domain coder.
A speech coder may be configured to select a particular coding mode and/or rate according to one or more qualities of the signal to be encoded. For example, a speech coder may be configured to distinguish frames containing speech from frames containing non-speech signals, such as signaling tones, and to use different coding modes to encode the speech and non-speech frames.