Transmission of voice by digital techniques has become widespread, particularly in long distance telephony, packet-switched telephony such as Voice over IP (also called VoIP, where IP denotes Internet Protocol), and digital radio telephony such as cellular telephony. Such proliferation has created interest in reducing the amount of information used to transfer a voice communication over a transmission channel while maintaining the perceived quality of the reconstructed speech.
Devices that are configured to compress speech by extracting parameters that relate to a model of human speech generation are called “speech coders.” A speech coder generally includes an encoder and a decoder. The encoder typically divides the incoming speech signal (a digital signal representing audio information) into segments of time called “frames,” analyzes each frame to extract certain relevant parameters, and quantizes the parameters into an encoded frame. The encoded frames are transmitted over a transmission channel (i.e., a wired or wireless network connection) to a receiver that includes a decoder. The decoder receives and processes encoded frames, dequantizes them to produce the parameters, and recreates speech frames using the dequantized parameters.
In a typical conversation, each speaker is silent for about sixty percent of the time. Speech encoders are usually configured to distinguish frames of the speech signal that contain speech (“active frames”) from frames of the speech signal that contain only silence or background noise (“inactive frames”). Such an encoder may be configured to use different coding modes and/or rates to encode active and inactive frames. For example, speech encoders are typically configured to use fewer bits to encode an inactive frame than to encode an active frame. A speech coder may use a lower bit rate for inactive frames to support transfer of the speech signal at a lower average bit rate with little to no perceived loss of quality.
FIG. 1 illustrates a result of encoding a region of a speech signal that includes transitions between active frames and inactive frames. Each bar in the figure indicates a corresponding frame, with the height of the bar indicating the bit rate at which the frame is encoded, and the horizontal axis indicates time. In this case, the active frames are encoded at a higher bit rate rH and the inactive frames are encoded at a lower bit rate rL.
Examples of bit rate rH include 171 bits per frame, eighty bits per frame, and forty bits per frame; and examples of bit rate rL include sixteen bits per frame. In the context of cellular telephony systems (especially systems that are compliant with Interim Standard (IS)-95 as promulgated by the Telecommunications Industry Association, Arlington, Va., or a similar industry standard), these four bit rates are also referred to as “full rate,” “half rate,” “quarter rate,” and “eighth rate,” respectively. In one particular example of the result shown in FIG. 1, rate rH is full rate and rate rL is eighth rate.
Voice communications over the public switched telephone network (PSTN) have traditionally been limited in bandwidth to the frequency range of 300-3400 kilohertz (kHz). More recent networks for voice communications, such as networks that use cellular telephony and/or VoIP, may not have the same bandwidth limits, and it may be desirable for apparatus using such networks to have the ability to transmit and receive voice communications that include a wideband frequency range. For example, it may be desirable for such apparatus to support an audio frequency range that extends down to 50 Hz and/or up to 7 or 8 kHz. It may also be desirable for such apparatus to support other applications, such as high-quality audio or audio/video conferencing, delivery of multimedia services such as music and/or television, etc., that may have audio speech content in ranges outside the traditional PSTN limits.
Extension of the range supported by a speech coder into higher frequencies may improve intelligibility. For example, the information in a speech signal that differentiates fricatives such as ‘s’ and ‘f’ is largely in the high frequencies. Highband extension may also improve other qualities of the decoded speech signal, such as presence. For example, even a voiced vowel may have spectral energy far above the PSTN frequency range.
While it may be desirable for a speech coder to support a wideband frequency range, it is also desirable to limit the amount of information used to transfer a voice communication over the transmission channel. A speech coder may be configured to perform discontinuous transmission (DTX), for example, such that descriptions are transmitted for fewer than all of the inactive frames of a speech signal.