With the emergence of digital wireless telephone networks, streaming audio over the Internet, and Internet telephony, digital processing and delivery of speech has become commonplace. Engineers use a variety of techniques to process speech efficiently while still maintaining quality. To understand these techniques, it helps to understand how audio information is represented and processed in a computer.
I. Representation of Audio Information in a Computer
A computer processes audio information as a series of numbers representing the audio. A single number can represent an audio sample, which is an amplitude value (i.e., loudness) at a particular time. Several factors affect the quality of the audio, including sample depth and sampling rate.
Sample depth (or precision) indicates the range of numbers used to represent a sample. The more values possible for the sample, the higher the quality because the number can capture more subtle variations in amplitude. An 8-bit sample has 256 possible values, while a 16-bit sample has 65,536 possible values. A 24-bit sample can capture normal loudness variations very finely, and can also capture unusually high loudness.
The sampling rate (usually measured as the number of samples per second) also affects quality. The higher the sampling rate, the higher the quality because more frequencies of sound can be represented. Some common sampling rates are 8,000, 11,025, 22,050, 32,000, 44,100, 48,000, and 96,000 samples/second. Table 1 shows several formats of audio with different quality levels, along with corresponding raw bitrate costs.
TABLE 1Bitrates for different quality audioSample DepthSampling RateChannelRaw Bitrate(bits/sample)(samples/second)mode(bits/second)88,000mono64,000811,025mono88,2001644,100stereo1,411,200
As Table 1 shows, the cost of high quality audio is high bitrate. High quality audio information consumes large amounts of computer storage and transmission capacity. Many computers and computer networks lack the resources to process raw digital audio. Compression (also called encoding or coding) decreases the cost of storing and transmitting audio information by converting the information into a lower bitrate form. Compression can be lossless (in which quality does not suffer) or lossy (in which quality suffers but bitrate reduction from subsequent lossless compression is more dramatic). Decompression (also called decoding) extracts a reconstructed version of the original information from the compressed form. A codec is an encoder/decoder system.
II. Speech Encoders and Decoders
The primary goal of audio compression is to digitally represent audio signals to provide maximum signal quality with the least possible amount of bits. Different kinds of audio signals have different characteristics. Music is characterized by large ranges of frequencies and amplitudes, and often includes 2 or more channels. On the other hand, speech is characterized by smaller ranges of frequencies and amplitudes, and is commonly represented in a single channel. Certain codecs and processing techniques are adapted for music and general audio; other codecs and processing techniques are adapted for speech.
A conventional speech codec uses linear prediction to achieve compression. The speech encoding includes several stages. The encoder finds and quantizes coefficients for a linear prediction filter, which is used to predict sample values as linear combinations of preceding sample values. A residual signal (represented as an “excitation” signal) indicates parts of the original signal not accurately predicted by the filtering. At some stages, the speech codec uses different compression techniques for voiced segments (characterized by vocal chord vibration), unvoiced segments, and silent segments, since different kinds of speech have different characteristics. Voiced segments typically exhibit highly repeating voicing patterns, even in the residual domain. For voiced segments, the encoder achieves further compression by comparing the current residual signal to previous residual cycles and encoding the current residual signal in terms of delay or lag information relative to the previous cycles. The encoder handles other discrepancies between the original signal and the predicted, encoded representation using specially designed codebooks.
International Telecommunications Union [“ITU”] Recommendation G.729 is a standard for coding speech at 8 kilobits per second using conjugate structure algebraic-code-excited linear prediction [“CS-ACELP”]. The codec operates on speech frames of 10 ms, which correspond to 80 samples at a sampling rate of 8000 samples per second. For every 10 ms frame, the encoder analyzes the speech signal to extract the parameters of the CELP model. The parameters include linear prediction filter coefficients per frame and various excitation parameters per 5 ms sub-frame of the frame. The excitation parameters represent the excitation signal, which is used in the encoder and decoder as input to the LPC synthesis filter. The excitation parameters include pitch (to represent the excitation signal with reference to previous excitation cycles), remainder indices (to represent remaining parts of the excitation signal), and gains (to scale the contributions from the pitch and/or remainder indices). The parameters are encoded and transmitted.
At the decoder, the excitation parameters are decoded and used to reconstruct the excitation signal. The linear prediction filter coefficients are decoded and used in the synthesis filter, which is sometimes called the “short-term prediction” filter. The excitation signal is fed to the synthesis filter, which predicts samples as linear combinations of previously reconstructed samples and adjusts the synthesis filter output (linear predicted values) by adding values from the excitation signal. For more details, see ITU-T Recommendation G.729.
Aside from G.729, various other standards have specified speech encoders and/or decoders, and various companies and researchers have produced speech encoders and/or decoders. For example, whereas G.729 describes a fixed bitrate encoder (8 Kb/s), the Adaptive Multirate [“AMR”] codec operates adaptively at various different bitrates. For more details about the AMR codec, see the articles by (1) Salami et al., entitled “The Adaptive Multi-Rate Wideband Codec: History and Performance,” Proc. IEEE Workshop on Speech Coding, 2002, pp. 144-146 (2002); (2) Lakaniemi et al., entitled “AMR and AMR-WB RTP Payload Usage in Packet Switched Conversational Multimedia Services,” Proc. IEEE Workshop on Speech Coding, 2002, pp. 147-149 (2002); (3) Johansson et al., entitled “Bandwidth Efficient AMR Operation for VoIP,” Proc. IEEE Workshop on Speech Coding, 2002, pp. 150-152 (2002); and (4) Makinen et al., entitled “The Effect of Source Based Rate Adaptation Extension in AMR-WB Speech Codec,” Proc. IEEE Workshop on Speech Coding, 2002, pp. 153-155 (2002).
Many speech codecs exploit temporal redundancy in a signal in some way. One common way uses long-term prediction of pitch parameters to predict a current excitation signal in terms of delay or lag relative to previous excitation cycles. Delay values in the range of 30-120 samples or even more samples are common. Exploiting temporal redundancy can greatly improve compression efficiency, but at the cost of introducing memory dependency into the codec—a decoder relies on one part of the signal to correctly decode another part of the signal. In general, the most efficient speech codecs have significant memory dependence.
Although speech codecs as described above have good overall performance for many applications, they have several drawbacks. In particular, several drawbacks surface when the speech codecs are used in conjunction with dynamic network resources. In such scenarios, encoded speech may be lost because of a temporary bandwidth shortage or condition problem.
A. Inefficient Memory Dependence in Dynamic Network Conditions
When encoded speech is lost, performance of speech codecs can suffer due to memory dependence upon the lost information. Loss of information for an excitation signal hampers later reconstruction that depends on the excitation signal. If previous cycles are lost, lag information is not useful, as it points to information the decoder does not have. Another example of memory dependence is filter coefficient interpolation (used to smooth the transitions between different synthesis filters, especially for voiced signals). If filter coefficients for a frame are lost, the filter coefficients for subsequent frames may have incorrect values.
Decoders use various techniques to conceal errors due to packet losses and other information loss, but these concealment techniques rarely conceal the errors fully. For example, the decoder repeats previous parameters or estimates parameters based upon correctly decoded information. Lag information is very sensitive, however, and such techniques are not particularly effective for concealment.
In most cases, decoders eventually recover from errors due to lost information. As packets are received and decoded, parameters are gradually adjusted toward their correct values. Quality is likely to be degraded until the decoder can recover the correct internal state, however. In many of the most efficient speech codecs, playback quality is degraded for an extended period of time (e.g., up to a second), causing high distortion and often rendering the speech unintelligible. Recovery times are faster when a significant change occurs, such as a silent frame, as this provides a natural reset point for many parameters.
This memory dependence problem is described in the article by Andersen et al., entitled “ILBC—a Linear Predictive Coder with Robustness to Packet Losses,” Proc. IEEE Workshop on Speech Coding, 2002, pp. 23-25 (2002) [“Andersen article”]. The Andersen article suggests remedying the memory dependence problem by using “frame-independent long-term prediction.” The codec operates on 240-sample frames. For every frame, the encoder computes LPC filter coefficients and uses interpolation for the filter coefficients. For each frame, a residual signal is computed and split into 6 40-sample sub-frames. 57 samples of the two consecutive sub-frames with the highest residual energy are encoded sample-by-sample as a “start state vector” at the frame-level. The remaining samples of the frame are encoded at the sub-frame level with reference to the start state vector (and potentially other previously decoded samples) in the same frame. In this way, the codec avoids dependencies across frame boundaries from delay-type prediction of residual signals. On the other hand, by forcing every frame to include a start state vector and have no cross-frame long-term prediction, the codec gives up much of the compression efficiency of long-term prediction. Moreover, the codec is inflexible in that every frame includes a frame-level start state vector and predicted sub-frames without cross-frame prediction, even when network conditions do not warrant such cautious encoding measures. Further, while addressing memory dependencies due to cross-frame prediction of residual signals, the codec still interpolates filter coefficients for every frame, which can lead to problems when the information for a given frame is lost.
Memory dependence problems for line spectrum frequency [“LSF”] parameters in speech codecs are described in the article by Wang et al, entitled “Performance Comparison of Intraframe and Interframe LSF Quantization in Packet Networks,” Proc. IEEE Workshop on Speech Coding, 2000, pp. 126-128 (2000). This article does not address the more general problem of memory dependence for packets with information such as excitation signal parameters.
Outside of the area of speech compression, various video codec standards and products use a mixture of intra frames and predicted frames to code and decode video.
B. Inefficient FEC in Dynamic Network Conditions
Various speech codecs use forward error correction [“FEC”] to address loss of encoded information. In general, the term FEC refers to a class of techniques for controlling errors in a system. FEC involves sending extra information along with primary information. The extra information can be used by the receiver, if necessary, to correct or replace corresponding primary information if the primary information is lost.
Some speech codecs have implemented FEC by re-encoding speech information with new parameters. Re-encoding involves encoding with the same or different codecs, and sending the speech multiple times for different quality levels/bitrates. If the highest rate copy is received, then it is used for decoding. Otherwise, the decoder utilizes a lower rate copy it receives. This FEC technique consumes extra encoder-side resources and can lead to problems in switching between the different sets of content. Moreover, it does not adapt fast enough for many real-time applications, nor does it use codec-dependent knowledge or information about the dynamic state of the encoder to regulate FEC. One multiple-codec recovery technique is described in the article by Morinaga et al., entitled “The Forward-Backward Recovery Sub-Codec (FB-RSC) Method: A Robust Form of Packet-Loss Concealment for Use in Broadband IP Networks,” Proc. IEEE Workshop on Speech Coding, 2002, pp. 62-64 (2002)
Other speech codecs repeat encoded frames in different packets such that any received packet can be used to decode the frame. The Lakaniemi and Johansson articles describe speech codecs that have implemented FEC by repetition of packets of previously encoded information. Packet repetition is simple and does not consume many additional processing resources, but it doubles transmission rate. If information is lost because of a temporary network bandwidth shortage or condition problem, sending the same packet multiple times can exacerbate the problem and hurt overall quality.
The Johansson article also describes a “partial redundancy” FEC mode for repeating the most important coded speech bits, depending on channel quality and estimated improvement over default concealment methods. This partial redundancy mode does not adequately consider currently available bandwidth, and does not provide multiple sets of partially redundant information to account for loss of consecutive packets.
Some streaming audio applications and non-real-time audio applications use re-transmission or stream switching. Low latency is a criterion of real-time communication, however, and re-transmission and switching schemes are not feasible for that reason.
C. Inefficient Rate Control in Dynamic Network Conditions
Existing speech codecs are mainly fixed-rate and do not provide adequate adaptability. Some existing speech codecs choose bitrate dynamically according to the characteristics of the input signal to accommodate a fixed network bandwidth target.
Other speech codecs adapt the rate of encoded output. AMR is a variable rate codec, and can adapt rate to the complexity of the input signal, network noise conditions, and/or network bandwidth. See the Salami and Makinen articles. Various real-time voice codecs from Microsoft Corporation switch between different codec modes to change rate for different kinds of content. See U.S. Patent Application Publication No. 2003/0101050 to Khalil et al. and U.S. Pat. No. 6,658,383 to Koishida et al. The transition between frames coded at different qualities may not be smooth in some cases, however, and previous speech codecs do not adequately account for smoothness in transitions between quality levels.
As noted, various previous codecs react to network conditions by changing quality and bitrate, but still focus on primary encoding efficiency (reconstruction quality for given bitrate assuming no losses.). These codecs do not adequately consider currently available bitrate and do not integrate FEC with rate control so as to allow adaptation of the emphasis given to FEC vs. primary encoding efficiency, for a given number of available bits for encoding. The Johansson article describes selecting between modes for frame redundancy, “selective redundancy” for sensitive frames, and “partial redundancy,” depending on decoder feedback regarding packet losses. These mode selection decisions do not, however, take into account the amount of available bits given bandwidth estimates and the complexity and content of a current frame.