With the introduction of compact disks, digital wireless telephone networks, and audio delivery over the Internet, digital audio has become commonplace. Engineers use a variety of techniques to process digital audio efficiently while still maintaining the quality of the digital audio. To understand these techniques, it helps to understand how audio information is represented and processed in a computer.
I. Representation of Audio Information in a Computer
A computer processes audio information as a series of numbers representing the audio information. For example, a single number can represent an audio sample, which is an amplitude value (i.e., loudness) at a particular time. Several factors affect the quality of the audio information, including sample depth, sampling rate, and channel mode.
Sample depth (or precision) indicates the range of numbers used to represent a sample. The more values possible for the sample, the higher the quality because the number can capture more subtle variations in amplitude. For example, an 8-bit sample has 256 possible values, while a 16-bit sample has 65,536 possible values.
The sampling rate (usually measured as the number of samples per second) also affects quality. The higher the sampling rate, the higher the quality because more frequencies of sound can be represented. Some common sampling rates are 8,000, 11,025, 22,050, 32,000, 44,100, 48,000, and 96,000 samples/second.
Table 1 shows several formats of audio with different quality levels, along with corresponding raw bitrate costs.
TABLE 1Bitrates for different quality audio informationSampleSampling RateDepth(samples/Raw BitrateQuality(bits/sample)second)Mode(bits/second)Internet telephony88,000mono64,000Telephone811,025mono88,200CD audio1644,100stereo1,411,200High quality audio1648,000stereo1,536,000
As Table 1 shows, the cost of high quality audio information such as CD audio is high bitrate. High quality audio information consumes large amounts of computer storage and transmission capacity. Companies and consumers increasingly depend on computers, however, to create, distribute, and play back high quality audio content.
II. Audio Compression and Decompression
Many computers and computer networks lack the resources to process raw digital audio. Compression (also called encoding or coding) decreases the cost of storing and transmitting audio information by converting the information into a lower bitrate form. Compression can be lossless (in which quality does not suffer) or lossy (in which quality suffers but bitrate reduction through lossless compression is more dramatic). Decompression (also called decoding) extracts a reconstructed version of the original information from the compressed form.
Generally, the goal of audio compression is to digitally represent audio signals to provide maximum signal quality with the least possible amount of bits. A conventional audio encoder/decoder [“codec”] system uses subband/transform coding, quantization, rate control, and variable length coding to achieve its compression. The quantization and other lossy compression techniques introduce potentially audible noise into an audio signal. The audibility of the noise depends on how much noise there is and how much of the noise the listener perceives. The first factor relates mainly to objective quality, while the second factor depends on human perception of sound. The conventional audio encoder then losslessly compresses the quantized data using variable length coding to further reduce bitrate.
A. Lossy Compression and Decompression of Audio Data
Conventionally, an audio encoder uses a variety of different lossy compression techniques. These lossy compression techniques typically involve frequency transforms, perceptual modeling/weighting, and quantization. The corresponding decompression involves inverse quantization, inverse weighting, and inverse frequency transforms.
Frequency transform techniques convert data into a form that makes it easier to separate perceptually important information from perceptually unimportant information. The less important information can then be subjected to more lossy compression, while the more important information is preserved, so as to provide the best perceived quality for a given bitrate. A frequency transformer typically receives the audio samples and converts them into data in the frequency domain, sometimes called frequency coefficients or spectral coefficients.
Most energy in natural sounds such as speech and music is concentrated in the low frequency range. This means that, statistically, higher frequency ranges will have more frequency coefficients that are zero or near zero, reflecting the lack of energy in the higher frequency ranges.
Perceptual modeling involves processing audio data according to a model of the human auditory system to improve the perceived quality of the reconstructed audio signal for a given bitrate. For example, an auditory model typically considers the range of human hearing and critical bands. Using the results of the perceptual modeling, an encoder shapes noise (e.g., quantization noise) in the audio data with the goal of minimizing the audibility of the noise for a given bitrate. While the encoder must at times introduce noise (e.g., quantization noise) to reduce bitrate, the weighting allows the encoder to put more noise in bands where it is less audible, and vice versa.
Quantization maps ranges of input values to single values, introducing irreversible loss of information or quantization noise, but also allowing an encoder to regulate the quality and bitrate of the output. Sometimes, the encoder performs quantization in conjunction with a rate controller that adjusts the quantization to regulate bitrate and/or quality. There are various kinds of quantization, including adaptive and non-adaptive, scalar and vector, uniform and non-uniform. Perceptual weighting can be considered a form of non-uniform quantization.
Inverse quantization and inverse weighting reconstruct the weighted, quantized frequency coefficient data to an approximation of the original frequency coefficient data. The inverse frequency transformer then converts the reconstructed frequency coefficient data into reconstructed time domain audio samples.
B. Lossless Compression and Decompression of Audio Data
Conventionally, an audio encoder uses one or more of a variety of different lossless compression techniques. In general, lossless compression techniques include run-length encoding, Huffman encoding, and arithmetic coding. The corresponding decompression techniques include run-length decoding, Huffman decoding, and arithmetic decoding.
Run-length encoding is a simple, well-known compression technique used for camera video, text, and other types of content. In general, run-length encoding replaces a sequence (i.e., run) of consecutive symbols having the same value with the value and the length of the sequence. In run-length decoding, the sequence of consecutive symbols is reconstructed from the run value and run length. Numerous variations of run-length encoding/decoding have been developed. For additional information about run-length encoding/decoding and some of its variations, see, e.g., Bell et al., Text Compression, Prentice Hall PTR, pages 105-107, 1990; Gibson et al., Digital Compression for Multimedia, Morgan Kaufmann, pages 17-62, 1998; U.S. Pat. No. 6,304,928 to Mairs et al.; U.S. Pat. No. 5,883,633 to Gill et al; and U.S. Pat. No. 6,233,017 to Chaddha.
Run-level encoding is similar to run-length encoding in that runs of consecutive symbols having the same value are replaced with run lengths. The value for the runs is the predominant value (e.g., 0) in the data, and runs are separated by one or more levels having a different value (e.g., a non-zero value).
The results of run-length encoding (e.g., the run values and run lengths) or run-level encoding can be Huffman encoded to further reduce bitrate. If so, the Huffman encoded data is Huffman decoded before run-length decoding.
Huffman encoding is another well-known compression technique used for camera video, text, and other types of content. In general, a Huffman code table associates variable-length Huffman codes with unique symbol values (or unique combinations of values). Shorter codes are assigned to more probable symbol values, and longer codes are assigned to less probable symbol values. The probabilities are computed for typical examples of some kind of content. Or, the probabilities are computed for data just encoded or data to be encoded, in which case the Huffman codes adapt to changing probabilities for the unique symbol values. Compared to static Huffman coding, adaptive Huffman coding usually reduces the bitrate of compressed data by incorporating more accurate probabilities for the data, but extra information specifying the Huffman codes may also need to be transmitted.
To encode symbols, the Huffman encoder replaces symbol values with the variable-length Huffman codes associated with the symbol values in the Huffman code table. To decode, the Huffman decoder replaces the Huffman codes with the symbol values associated with the Huffman codes.
In scalar Huffman coding, a Huffman code table associates a single Huffman code with one value, for example, a direct level of a quantized data value. In vector Huffman coding, a Huffman code table associates a single Huffman code with a combination of values, for example, a group of direct levels of quantized data values in a particular order. Vector Huffman encoding can lead to better bitrate reduction than scalar Huffman encoding (e.g., by allowing the encoder to exploit probabilities fractionally in binary Huffman codes). On the other hand, the codebook for vector Huffman encoding can be extremely large when single codes represent large groups of symbols or symbols have large ranges of potential values (due to the large number of potential combinations). For example, if the alphabet size is 256 (for values 0 to 255 per symbol) and the number of symbols per vector is 4, the number of potential combinations is 2564=4,294,967,296. This consumes memory and processing resources in computing the codebook and finding Huffman codes, and consumes transmission resources in transmitting the codebook.
Numerous variations of Huffman encoding/decoding have been developed. For additional information about Huffman encoding/decoding and some of its variations, see, e.g., Bell et al., Text Compression, Prentice Hall PTR, pages 105-107, 1990; Gibson et al., Digital Compression for Multimedia, Morgan Kaufmann, pages 17-62, 1998.
U.S. Pat. No. 6,223,162 to Chen et al. describes multi-level run-length coding of audio data. A frequency transformation produces a series of frequency coefficient values. For portions of a frequency spectrum in which the predominant value is zero, a multi-level run-length encoder statistically correlates runs of zero values with adjacent non-zero values and assigns variable length code words. An encoder uses a specialized codebook generated with respect to the probability of receiving an input run of zero-valued spectral coefficients followed by a non-zero coefficient. A corresponding decoder associates a variable length code word with a run of zero value coefficients and adjacent non-zero value coefficient.
U.S. Pat. No. 6,377,930 to Chen et al. describes variable to variable length encoding of audio data. An encoder assigns a variable length code to a variable size group of frequency coefficient values.
U.S. Pat. No. 6,300,888 to Chen et al. describes entropy code mode switching for frequency domain audio coding. A frequency-domain audio encoder selects among different entropy coding modes according to the characteristics of an input stream. In particular, the input stream is partitioned into frequency ranges according to statistical criteria derived from statistical analysis of typical or actual input to be encoded. Each range is assigned an entropy encoder optimized to encode that range's type of data. During encoding and decoding, a mode selector applies the correct method to the different frequency ranges. Partition boundaries can be decided in advance, allowing the decoder to implicitly know which decoding method to apply to encoded data. Or, adaptive arrangements may be used, in which boundaries are flagged in the output stream to indicate a change in encoding mode for subsequent data. For example, a partition boundary separates primarily zero quantized frequency coefficients from primarily non-zero quantized coefficients, and then applies coders optimized for such data.
For additional detail about the Chen patents, see the patents themselves.
Arithmetic coding is another well-known compression technique used for camera video and other types of content. Arithmetic coding is sometimes used in applications where the optimal number of bits to encode a given input symbol is a fractional number of bits, and in cases where a statistical correlation among certain individual input symbols exists. Arithmetic coding generally involves representing an input sequence as a single number within a given range. Typically, the number is a fractional number between 0 and 1. Symbols in the input sequence are associated with ranges occupying portions of the space between 0 and 1. The ranges are calculated based on the probability of the particular symbol occurring in the input sequence. The fractional number used to represent the input sequence is constructed with reference to the ranges. Therefore, probability distributions for input symbols are important in arithmetic coding schemes.
In context-based arithmetic coding, different probability distributions for the input symbols are associated with different contexts. The probability distribution used to encode the input sequence changes when the context changes. The context can be calculated by measuring different factors that are expected to affect the probability of a particular input symbol appearing in an input sequence. For additional information about arithmetic encoding/decoding and some of its variations, see Nelson, The Data Compression Book, “Huffman One Better: Arithmetic Coding,” Chapter 5, pp. 123-65 (1992).
Various codec systems and standards use lossless compression and decompression, including versions of Microsoft Corporation's Windows Media Audio [“WMA”] encoder and decoder. Other codec systems are provided or specified by the Motion Picture Experts Group, Audio Layer 3 [“MP3”] standard, the Motion Picture Experts Group 2, Advanced Audio Coding [“AAC”] standard, and Dolby AC3. For additional information, see the respective standards or technical publications.
Whatever the advantages of prior techniques and systems for lossless compression of audio data, they do not have the advantages of the present invention.