Transmission of voice by digital techniques has become widespread, particularly in long distance telephony, packet-switched telephony such as Voice over IP (VoIP), and digital radio telephony such as cellular telephony. Such proliferation has created interest in reducing the amount of information used to transfer a voice communication over a transmission channel while maintaining the perceived quality of the reconstructed speech.
Devices that are configured to compress speech by extracting parameters that relate to a model of human speech generation are called “speech coders.” A speech coder generally includes an encoder and a decoder. The encoder typically divides the incoming speech signal (a digital signal representing audio information) into segments of time called “frames,” analyzes each frame to extract certain relevant parameters, and quantizes the parameters into a binary representation, such as a set of bits or a binary data packet. The data packets are transmitted over a transmission channel (i.e., a wired or wireless network connection) to a receiver that includes a decoder. The decoder receives and processes data packets, dequantizes them to produce the parameters, and recreates speech frames using the dequantized parameters.
In a typical conversation, each speaker is silent for about sixty percent of the time. Speech encoders are usually configured to distinguish frames of the speech signal that contain speech (“active frames”) from frames of the speech signal that contain only silence or background noise (“inactive frames”). Such an encoder may be configured to use different coding modes and/or rates to encode active and inactive frames. For example, speech encoders are typically configured to transmit encoded inactive frames (also called “silence descriptors,” “silence descriptions,” or SIDs) at a lower bit rate than encoded active frames.
At any time during a full duplex telephonic communication, it may be expected that the input to at least one of the speech encoders will be an inactive frame. It may be desirable for an encoder to transmit SIDs for fewer than all of the inactive frames. Such operation is also called discontinuous transmission (DTX). In one example, a speech encoder performs DTX by transmitting one SID for each string of 32 consecutive inactive frames. The corresponding decoder applies information in the SID to update a noise generation model that is used by a comfort noise generation algorithm to synthesize inactive frames.