1. Field of the Invention
The present invention generally relates to digital processing, specifically audio encoding and decoding, and more particularly to a method of encoding and decoding audio signals using psychoacoustic-based compression.
2. Description of the Related Art
Many audio encoding technologies use psychoacoustic methods to code audio signals in a perceptually transparent fashion. Due to the finite time-frequency resolution of the human auditory anatomy, the ear is able to perceive only a limited amount of information present in the stimulus. Accordingly, it is possible to compress or filter out portions of an audio signal, effectively discarding that information, without sacrificing the perceived quality of the reconstructed signal.
One audio encoder which uses psychoacoustic compression is the MPEG-1 Layer 3 (also referred to as “MP3”). MPEG is an acronym for the Moving Pictures Expert Group, an industry standards body created to develop comprehensive guidelines for the transmission of digitally encoded audio and video (moving pictures) data. MP3 encoding is described in detail ISO/IEC 11172-3, Information Technology—Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to about 1.5 Mbit/s—which is incorporated by reference herein in its entirety. There are currently three “layers” of audio encoding in the MPEG-1 standard, offering increasing levels of compression at the cost of higher computational requirements. The standard supports three sampling rates of 32, 44.1 and 48 kHz, and output bit rates between 32 and 384 kbits/sec. The transmission can be mono, dual channel (e.g., bilingual), stereo, or joint stereo (where the redundancy or correlations between the left and right channels can be exploited).
MPEG Layer 1 is the lowest encoder complexity, using a 32 subband polyphase analysis filterbank, and a 512-point fast Fourier transform (FFT) for the psychoacoustic model. The optimal bit rate per channel for MPEG Layer 1 is at least 192 kbits/sec. Typical data reduction rates (for stereo signals) are about 4 times. The most common application for MPEG Layer 1 is digital compact cassettes (DCCs).
MPEG Layer 2 has moderate encoder complexity using a 1024-point FFT for the psychoacoustic model and more efficient coding of side information. The optimal bit rate per channel for MPEG Layer 2 is at least 128 kbits/sec. Typical data reduction rates (for stereo signals) are about 6–8 times. Common applications for MPEG Layer 2 include video compact discs (V-CDs) and digital audio broadcast.
MPEG Layer 3 has the highest encoder complexity applying a frequency transform to all subbands for increased resolution and allowing for a variable bit rate. Layer 3 (sometimes referred to as Layer III) combines attributes of both the MUSICAM and ASPEC coders. The coded bit stream can provide an embedded error-detection code by way of cyclical redundancy checks (CRC). The encoding and decoding algorithms are asymmetrical, that is, the encoder is more complicated and computationally expensive than the decoder. The optimal bit rate per channel for MPEG Layer 3 is at least 64 kbits/sec. Typical data reduction rates (for stereo signals) are about 10–12 times. One common application for MPEG Layer 3 is high-speed streaming using, for example, an integrated services digital network (ISDN).
The standard describing each of these MPEG-1 layers specifies the syntax of coded bit streams, defines decoding processes, and provides compliance tests for assessing the accuracy of the decoding processes. However, there are no MPEG-1 compliance requirements for the encoding process except that it should generate a valid bit stream that can be decoded by the specified decoding processes. System designers are free to add other features or implementations as long as they remain within the relatively broad bounds of the standard.
The MP3 algorithm has become the de facto standard for multimedia applications, storage applications, and transmission over the Internet. The MP3 algorithm is also used in popular portable digital players. MP3 takes advantage of the limitations of the human auditory system by removing parts of the audio signal that cannot be detected by the human ear. Specifically, MP3 takes advantage of the inability of the human ear to detect quantization noise in the presence of auditory masking. A very basic functional block diagram of an MP3 audio coder/decoder (codec) is illustrated in FIGS. 1A and 1B.
The algorithm operates on blocks of data. The input audio stream to the encoder 1 is typically a pulse-code modulated (PCM) signal which is sampled at or more than twice the highest frequency of the original analog source, as required by Nyquist's theorem. The PCM samples in a data block are fed to an analysis filterbank 2 and a perceptual model 3. Filterbank 2 divides the data into multiple frequency subbands (for MP3, there are 32 subbands which correspond in frequency to those used by Layer 2). The same data block of PCM samples is used by perceptual model 3 to determine a ratio of signal energy to a masking threshold for each scalefactor band (a scalefactor band is a grouping of transform coefficients which approximately represents a critical band of human hearing). The masking thresholds are set according to the particular psychoacoustic model employed. The perceptual model also determines whether the subsequent transform, such as a modified discrete cosine transform (MDCT), is applied using short or long time windows. Each subband can be further subdivided; MP3 subdivides each of the 32 subbands into 18 transform coefficients for a total of 576 transform coefficients using an MDCT. Based on the masking ratios provided by the perceptual model and the available bits (i.e., the target bit rate), bit/noise allocation, quantization and coding unit 4 iteratively allocates bits to the various transform coefficients so as to reduce to the audibility of the quantization noise. These quantized subband samples and the side information are packed into a coded bit stream (frame) by bitpacker 5 which uses entropy coding. Ancillary data may also be inserted into the frame, but such data reduces the number of bits that can be devoted to the audio encoding. The frame may additionally include other bits, such as a header and CRC check bits.
As seen in FIG. 1B, the encoded bit stream is transmitted to a decoder 6. The frame is received by a bit stream unpacker 7, which strips away any ancillary data and side information. The encoded audio bits are passed to a frequency sample reconstruction unit 8 which deciphers and extracts the quantized subband values. Synthesis filterbank 9 is then used to restore the values to a PCM signal.
FIG. 2 further illustrates the manner in which the subband values are determined by bit/noise allocation, quantization and coding unit 4 as prescribed by ISO/IEC 11172-3. Initially, a scalefactor of unity (1.0) is set for each scalefactor band at block 10. Transform coefficients are provided by the frequency domain transform of the analog samples at block 11 using, for example, an MDCT. The initial scalefactors are then respectively applied at block 12 to the transform coefficients for each scalefactor band. A global gain factor is then set to its maximum possible value at block 13. The total gain for a particular scalefactor band is the global gain combined with the scalefactor for that particular scalefactor band. The global gain is applied in block 14 to each of the scalefactor bands, and the quantization process is then carried out for each scalefactor band at block 15. Quantization rounds each amplified transform coefficient to the nearest integer value. A calculation is performed in block 16 to determine the number of bits that are necessary to encode the quantized values, typically based on Huffman encoding. For example, with a target bit rate of 128 kbps and a sampling frequency of 44.1 kHz, a stereo-compressed MP3 frame has about 3344 bits available, of which 3056 can be used for audio signal encoding while the remainder are used for header and side information. If the number of bits required is greater than the number available as determined in block 17, the global gain is reduced in block 18. The process then repeats iteratively beginning with block 14. This first or “inner” loop repeats until an appropriate global gain factor is established which will comport with the number of available bits.
Once an appropriate global gain factor is established by the inner loop, the distortion for each scalefactor band (sfb) is calculated at block 19. As seen in block 20, if the distortion values are less than the respective thresholds set by the mask of the perceptual model 3 being used, e.g., Psychoacoustic Model 2 as described in ISO/IEC 11172-3, then the quantization/allocation process is complete at block 22, and the bit stream can be packed for transmission. However, if any distortion value is greater than its respective threshold, the corresponding scalefactor is increased at block 21, and the entire process repeats iteratively beginning with step 12. This second or “outer” loop repeats until appropriate distortion values are calculated for all scalefactor bands. The re-execution of the outer loop necessarily results in the re-execution of the inner, nested loop as well. In other words, even though a global gain factor was already calculated by the inner loop in a previous iteration, that factor will be discarded when the outer loop repeats, and the global gain factor will be reset to the maximum at step 13. In this manner, the Layer III encoder 1 quantizes the spectral values by allocating just the right number of bits to each subband to maintain perceptual transparency at a given bit rate.
The outer loop is known as the distortion control loop while the inner loop is known as the rate control loop. The distortion control loop shapes the quantization noise by applying the scalefactors in each scalefactor band while the inner loop adjusts the global gain so that the quantized values can be encoded using the available bits. This approach to bit/noise allocation in quantization leads to several problems. Foremost among these problems is the excessive processing power that is required to carry out the computations due to the iterative nature of the loops, particularly since the loops are nested. Moreover, increasing the scalefactors does not always reduce noise because of the rounding errors involved in the quantization process and also because a given scalefactor is applied to multiple transform coefficients in a single scalefactor band. Furthermore, although the process is iterative, it does not use a convergent solution. Thus, there is no limit to the number of iterations that may be required (for real-time implementations, the process is governed by a time-out). This computationally intensive approach has the further consequence of consuming more power in an electronic device. It would, therefore, be desirable to devise an improved method of quantizing frequency domain values which did not require excessive iterations of scalefactor calculations. It would be further advantageous if the method could be easily implemented in either hardware or software.