With the emergence of digital wireless telephone networks, streaming audio over the Internet, and Internet telephony, digital processing and delivery of speech has become commonplace. Engineers use a variety of techniques to process speech efficiently while still maintaining quality. To understand these techniques, it helps to understand how audio information is represented and processed in a computer.
I. Representation of Audio Information in a Computer
A computer processes audio information as a series of numbers representing the audio. A single number can represent an audio sample, which is an amplitude value at a particular time. Several factors affect the quality of the audio, including sample depth and sampling rate.
Sample depth (or precision) indicates the range of numbers used to represent a sample. More possible values for each sample typically yields higher quality output because more subtle variations in amplitude can be represented. An eight-bit sample has 256 possible values, while a sixteen-bit sample has 65,536 possible values.
The sampling rate (usually measured as the number of samples per second) also affects quality. The higher the sampling rate, the higher the quality because more frequencies of sound can be represented. Some common sampling rates are 8,000, 11,025, 22,050, 32,000, 44,100, 48,000, and 96,000 samples/second (Hz). Table 1 shows several formats of audio with different quality levels, along with corresponding raw bit rate costs.
TABLE 1Bit rates for different quality audioSample DepthSampling RateChannelRaw Bit Rate(bits/sample)(samples/second)Mode(bits/second)88,000mono64,000811,025mono88,2001644,100stereo1,411,200
As Table 1 shows, the cost of high quality audio is high bit rate. High quality audio information consumes large amounts of computer storage and transmission capacity. Many computers and computer networks lack the resources to process raw digital audio. Compression (also called encoding or coding) decreases the cost of storing and transmitting audio information by converting the information into a lower bit rate form. Compression can be lossless (in which quality does not suffer) or lossy (in which quality suffers but bit rate reduction from subsequent lossless compression is more dramatic). Decompression (also called decoding) extracts a reconstructed version of the original information from the compressed form. A codec is an encoder/decoder system.
II. Speech Encoders and Decoders
One goal of audio compression is to digitally represent audio signals to provide maximum signal quality for a given amount of bits. Stated differently, this goal is to represent the audio signals with the least bits for a given level of quality. Other goals such as resiliency to transmission errors and limiting the overall delay due to encoding/transmission/decoding apply in some scenarios.
Different kinds of audio signals have different characteristics. Music is characterized by large ranges of frequencies and amplitudes, and often includes two or more channels. On the other hand, speech is characterized by smaller ranges of frequencies and amplitudes, and is commonly represented in a single channel. Certain codecs and processing techniques are adapted for music and general audio; other codecs and processing techniques are adapted for speech.
One type of conventional speech codec uses linear prediction to achieve compression. The speech encoding includes several stages. The encoder finds and quantizes coefficients for a linear prediction filter, which is used to predict sample values as linear combinations of preceding sample values. A residual signal (represented as an “excitation” signal) indicates parts of the original signal not accurately predicted by the filtering. At some stages, the speech codec uses different compression techniques for voiced segments (characterized by vocal chord vibration), unvoiced segments, and silent segments, since different kinds of speech have different characteristics. Voiced segments typically exhibit highly repeating voicing patterns, even in the residual domain. For voiced segments, the encoder achieves further compression by comparing the current residual signal to previous residual cycles and encoding the current residual signal in terms of delay or lag information relative to the previous cycles. The encoder handles other discrepancies between the original signal and the predicted, encoded representation (from the linear prediction and delay information) using specially designed codebooks.
Although some speech codecs described above have good overall performance for many applications, they have several drawbacks. In particular, several drawbacks surface when the speech codecs are used in conjunction with dynamic network resources. In such scenarios, encoded speech may be lost because of a temporary bandwidth shortage or other problem.
A. Ineffective Concealment Techniques
When one or more packets of encoded speech are missing, such as by where they are lost, delayed, corrupted or otherwise made unusable in transit or elsewhere, decoders often attempt to conceal the missing packets in some manner. For example, some decoders simply repeat packets that have already been received. If there are significant losses of packets, however, this technique can quickly result in degraded quality of the decoded speech output.
Some codecs use more sophisticated concealment techniques, such as the waveform similarity overlap-add method (“WSOLA”). This technique extends the decoded audio signal to conceal missing packets by generating new pitch cycles through weighted averages of existing pitch cycles. This method can be more effective in concealing missing packets than merely repeating earlier packets. However, it may not be ideal for all situations. Moreover, it can produce undesirable sound effects (such as a mechanical or ringing sound), if it is used to extend a signal for too long.
Additionally, many frames depend on memory of decoded characteristics of previous frames (such as excitation signal history) for decoding. When such memory does not exist (as where the packets that would have been used to produce the memory are lost, delayed, etc.), the signal quality may be degraded even for the received frames that follow missing frames.
B. Inefficient or Ineffective Desired Packet Delay Calculations
As packets of encoded audio information are being transported to a decoder application, each packet may experience a different delay due to, for example, network variations. This can also result in packets arriving in a different order than they were sent. An application or decoder may calculate delay statistics to determine a desired decoder buffer delay that is expected to be long enough to allow a sufficient number of packets to arrive at the decoder in time to be decoded and used. Of course, a countervailing concern may be overall delay in the system, especially for real-time, interactive applications such as telephony.
One approach to calculating the optimal delay is to look at the maximum delay of previous packets and use that delay value as a guide. The delay of a packet is typically determined by calculating the difference between a sent time stamp applied on the encoder side when the packet is sent and a received time stamp applied on the decoder side when the packet is received. However, sometimes outliers may exist, causing the system to adapt to unrepresentative packets. In addition, it is sometimes better to let a few packets arrive too late (and be missed) than to impose a delay long enough to receive those outlier, late packets.
One alternative is calculating the desired delay based on formulas such as running averages, and running variance calculations. However, many parameters need to be optimized in such calculations, and it is difficult to find the right tradeoff between calculation and response speed on the one hand, and basing the calculations on a representative population of history values on the other hand.
Another approach is to directly analyze the packet delay distribution. For example, a histogram of packet delays may be maintained. The width of a bin in the delay time histogram represents the desired accuracy with which the optimal delay will be calculated. Decreasing the bin size improves the accuracy. The shape of the histogram approximately mirrors the underlying packet delay distribution.
When a new packet arrives, the packet delay is mapped to the corresponding bin and the count of the packets that fall into that bin is incremented. To reflect the age of some old packets, counts in all other bins are scaled down in a process called “aging.” To find the new desired delay, the decoder sets a desired loss rate. Typical values range between one percent and five percent. The histogram is analyzed to determine the value of the desired delay that is needed to achieve the desired loss. One problem with this approach is that some parameters need to be tuned, such as bin width, and aging factors. In addition, all old packets are treated similarly in the aging process, and the aging approach itself plays an overly significant role in the overall performance of the technique. In addition, a clock-drift situation may occur. Clock drift occurs when the clock rates of different devices are not the same. If clock drift occurs between encoder-side devices that apply sent time stamps and decoder-side devices that apply received time stamps, the overall delay has either a positive or a negative trend. This can cause the histogram to drift along the delay timeline even when the histogram should be static.