Applications for communication and/or storage of a voice signal typically use a microphone to capture an audio signal that includes the sound of a primary speaker's voice. The part of the audio signal that represents the voice is called the speech or speech component. The captured audio signal will usually also include other sound from the microphone's ambient acoustic environment, such as background sounds. This part of the audio signal is called the context or context component.
Transmission of audio information, such as speech and music, by digital techniques has become widespread, particularly in long distance telephony, packet-switched telephony such as Voice over IP (also called VoIP, where IP denotes Internet Protocol), and digital radio telephony such as cellular telephony. Such proliferation has created interest in reducing the amount of information used to transfer a voice communication over a transmission channel while maintaining the perceived quality of the reconstructed speech. For example, it is desirable to make the best use of available wireless system bandwidth. One way to use system bandwidth efficiently is to employ signal compression techniques. For wireless systems which carry speech signals, speech compression (or “speech coding”) techniques are commonly employed for this purpose.
Devices that are configured to compress speech by extracting parameters that relate to a model of human speech generation are often called voice coders, codecs, vocoders, “audio coders,” or “speech coders,” and the description that follows uses these terms interchangeably. A speech coder generally includes a speech encoder and a speech decoder. The encoder typically receives a digital audio signal as a series of blocks of samples called “frames,” analyzes each frame to extract certain relevant parameters, and quantizes the parameters into an encoded frame. The encoded frames are transmitted over a transmission channel (i.e., a wired or wireless network connection) to a receiver that includes a decoder. Alternatively, the encoded audio signal may be stored for retrieval and decoding at a later time. The decoder receives and processes encoded frames, dequantizes them to produce the parameters, and recreates speech frames using the dequantized parameters.
In a typical conversation, each speaker is silent for about sixty percent of the time. Speech encoders are usually configured to distinguish frames of the audio signal that contain speech (“active frames”) from frames of the audio signal that contain only context or silence (“inactive frames”). Such an encoder may be configured to use different coding modes and/or rates to encode active and inactive frames. For example, inactive frames are typically perceived as carrying little or no information, and speech encoders are usually configured to use fewer bits (i.e., a lower bit rate) to encode an inactive frame than to encode an active frame.
Examples of bit rates used to encode active frames include 171 bits per frame, eighty bits per frame, and forty bits per frame. Examples of bit rates used to encode inactive frames include sixteen bits per frame. In the context of cellular telephony systems (especially systems that are compliant with Interim Standard (IS)-95 as promulgated by the Telecommunications Industry Association, Arlington, Va., or a similar industry standard), these four bit rates are also referred to as “full rate,” “half rate,” “quarter rate,” and “eighth rate,” respectively.