With the emergence of digital wireless telephone networks, streaming audio over the Internet, and Internet telephony, digital processing and delivery of speech has become commonplace. Engineers use a variety of techniques to process speech efficiently while still maintaining quality. To understand these techniques, it helps to understand how audio information is represented and processed in a computer.
I. Representation of Audio Information in a Computer
A computer processes audio information as a series of numbers representing the audio. A single number can represent an audio sample, which is an amplitude value at a particular time. Several factors affect the quality of the audio, including sample depth and sampling rate.
Sample depth (or precision) indicates the range of numbers used to represent a sample. More possible values for each sample typically yields higher quality output because more subtle variations in amplitude can be represented. An eight-bit sample has 256 possible values, while a 16-bit sample has 65,536 possible values.
The sampling rate (usually measured as the number of samples per second) also affects quality. The higher the sampling rate, the higher the quality because more frequencies of sound can be represented. Some common sampling rates are 8,000, 11,025, 22,050, 32,000, 44,100, 48,000, and 96,000 samples/second (Hz). Table 1 shows several formats of audio with different quality levels, along with corresponding raw bit rate costs.
TABLE 1Bit rates for different quality audioSample DepthSampling RateChannelRaw Bit Rate(bits/sample)(samples/second)Mode(bits/second)8 8,000mono  64,000811,025mono  88,2001644,100stereo1,411,200
As Table 1 shows, the cost of high quality audio is high bit rate. High quality audio information consumes large amounts of computer storage and transmission capacity. Many computers and computer networks lack the resources to process raw digital audio. Compression (also called encoding or coding) decreases the cost of storing and transmitting audio information by converting the information into a lower bit rate form. Compression can be lossless (in which quality does not suffer) or lossy (in which quality suffers but bit rate reduction from subsequent lossless compression is more dramatic). Decompression (also called decoding) extracts a reconstructed version of the original information from the compressed form. A codec is an encoder/decoder system.
II. Speech Encoders and Decoders
One goal of audio compression is to digitally represent audio signals to provide maximum signal quality for a given amount of bits. Stated differently, this goal is to represent the audio signals with the least bits for a given level of quality. Other goals such as resiliency to transmission errors and limiting the overall delay due to encoding/transmission/decoding apply in some scenarios.
Different kinds of audio signals have different characteristics. Music is characterized by large ranges of frequencies and amplitudes, and often includes two or more channels. On the other hand, speech is characterized by smaller ranges of frequencies and amplitudes, and is commonly represented in a single channel. Certain codecs and processing techniques are adapted for music and general audio; other codecs and processing techniques are adapted for speech.
One type of conventional speech codec uses linear prediction to achieve compression. The speech encoding includes several stages. The encoder finds and quantizes coefficients for a linear prediction filter, which is used to predict sample values as linear combinations of preceding sample values. A residual signal (represented as an “excitation” signal) indicates parts of the original signal not accurately predicted by the filtering. At some stages, the speech codec uses different compression techniques for voiced segments (characterized by vocal chord vibration), unvoiced segments, and silent segments, since different kinds of speech have different characteristics. Voiced segments typically exhibit highly repeating voicing patterns, even in the residual domain. For voiced segments, the encoder achieves further compression by comparing the current residual signal to previous residual cycles and encoding the current residual signal in terms of delay or lag information relative to the previous cycles. The encoder handles other discrepancies between the original signal and the predicted, encoded representation using specially designed codebooks.
Many speech codecs exploit temporal redundancy in a signal in some way. As mentioned above, one common way uses long-term prediction of pitch parameters to predict a current excitation signal in terms of delay or lag relative to previous excitation cycles. Exploiting temporal redundancy can greatly improve compression efficiency in terms of quality and bit rate, but at the cost of introducing memory dependency into the codec—a decoder relies on one, previously decoded part of the signal to correctly decode another part of the signal. Many efficient speech codecs have significant memory dependence.
Although speech codecs as described above have good overall performance for many applications, they have several drawbacks. In particular, several drawbacks surface when the speech codecs are used in conjunction with dynamic network resources. In such scenarios, encoded speech may be lost because of a temporary bandwidth shortage or other problems.
A. Narrowband and Wideband Codecs
Many standard speech codecs were designed for narrowband signals with an eight kHz sampling rate. While the eight kHz sampling rate is adequate in many situations, higher sampling rates may be desirable in other situations, such as to represent higher frequencies.
Speech signals with at least sixteen kHz sampling rates are typically called wideband speech. While these wideband codecs may be desirable to represent high frequency speech patterns, they typically require higher bit rates than narrowband codecs. Such higher bit rates may not be feasible in some types of networks or under some network conditions.
B. Inefficient Memory Dependence in Dynamic Network Conditions
When encoded speech is missing, such as by being lost, delayed, corrupted or otherwise made unusable in transit or elsewhere, performance of speech codecs can suffer due to memory dependence upon the lost information. Loss of information for an excitation signal hampers later reconstruction that depends on the lost signal. If previous cycles are lost, lag information may not be useful, as it points to information the decoder does not have. Another example of memory dependence is filter coefficient interpolation (used to smooth the transitions between different synthesis filters, especially for voiced signals). If filter coefficients for a frame are lost, the filter coefficients for subsequent frames may have incorrect values.
Decoders use various techniques to conceal errors due to packet losses and other information loss, but these concealment techniques rarely conceal the errors fully. For example, the decoder repeats previous parameters or estimates parameters based upon correctly decoded information. Lag information can be very sensitive, however, and prior techniques are not particularly effective for concealment.
In most cases, decoders eventually recover from errors due to lost information. As packets are received and decoded, parameters are gradually adjusted toward their correct values. Quality is likely to be degraded until the decoder can recover the correct internal state, however. In many of the most efficient speech codecs, playback quality is degraded for an extended period of time (e.g., up to a second), causing high distortion and often rendering the speech unintelligible. Recovery times are faster when a significant change occurs, such as a silent frame, as this provides a natural reset point for many parameters. Some codecs are more robust to packet losses because they remove inter-frame dependencies. However, such codecs require significantly higher bit rates to achieve the same voice quality as a traditional CELP codec with inter-frame dependencies.
Given the importance of compression and decompression to representing speech signals in computer systems, it is not surprising that compression and decompression of speech have attracted research and standardization activity. Whatever the advantages of prior techniques and tools, however, they do not have the advantages of the techniques and tools described herein.