In order to minimize the amount of data that must be stored and/or transmitted across a communication channel, content (e.g., audio and/or video information) is often compressed into a data stream with fewer bits than might otherwise be needed. Numerous methods for such compression have been developed. Some of those methods employ predictive coding techniques. For example, the Advanced Audio Coding (AAC) format specified by various Motion Picture Experts Group (MPEG) standards includes several sets of tools for coding (and subsequently decoding) audio content (e.g., music). Those tools, or profiles, include the Main, LC (Low Complexity), SSR (Scalable Sampling Rate) and LTP (Long-Term Prediction) profiles. LTP encoding can provide higher quality audio to the end-user, but at a price of increased computational requirements. This can result in a need for additional memory and processing hardware in a device such as a mobile phone or digital music player. Moreover, commercial necessity can require that devices intended to decode and play AAC audio data be able to accommodate multiple profiles. For example, users frequently wish to download music from a variety of sources. Some of those sources may encode music using the AAC-LC profile, while others may encode music using the AAC-LTP profile.
FIG. 1A is a block diagram showing a general structure for an AAC-LTP encoder. Although the operation of such encoders (and some corresponding decoders) is well known, the following overview is included to provide context for subsequent description. An incoming time domain audio signal is received by a long-term predictor 1, a modified discrete cosine transform (MDCT) 2, and by a psychoacoustic model 3. Long-term predictor 1 generates data (prediction coefficients and a pitch lag) that can be used to predict the currently input time-domain signal based on time domain signals for earlier portions of the audio stream. Time domain versions of those earlier portions are received as inputs from inverse modified discrete cosine transform (IMDCT) 4 and from a synthesis filter bank (not shown), and are stored by the long-term predictor in a buffer (also not shown in FIG. 1A). The prediction coefficients and pitch lag are provided by long-term predictor 1 to bit stream multiplexer 11. The predicted audio (i.e., the time domain audio signal that would result from the calculated prediction coefficients and pitch lag) is converted to the frequency domain by MDCT 5.
The incoming time domain audio is also provided to a separate MDCT 2. Unlike MDCT 5, which only transforms the predicted version of that audio, the original incoming audio signal is converted to the frequency domain by MDCT 2. The output from MDCT 2 is provided to a frequency selective switch (FSS) 7 (discussed below) and to a summer 6. Summer 6 computes a difference between the output of MDCT 5 (the frequency domain version of the predicted audio signal) and the output of MDCT 2 (the frequency domain version of the original audio signal). In effect, the output from summer 6 (or prediction error) is the difference between the actual audio signal and the predicted version of that same signal. The prediction error output from summer 6 is provided to FSS 7.
FSS 7 receives control inputs from psychoacoustic model 3. Psychoacoustic model 3 contains experimentally-derived perceptual data regarding frequency ranges that are perceptible to human listeners. Psychoacoustic model 3 further contains data regarding certain types of audio patterns that are not well modeled using long-term prediction. For example, fast changing or transient signal segments can be difficult to model by prediction. Psychoacoustic model 3 examines the incoming audio signal in the time domain and evaluates which sub-bands should be represented by prediction error (from summer 6), prediction coefficients (from predictor 1) and pitch lag (also from predictor 1), as well as which sub-bands should be represented by MDCT coefficients of the original audio (from MDCT 2). Based on data from psychoacoustic model 3, FSS 7 selects data to be forwarded to block 8 for quantization and coding. For sub-bands where prediction is to be used, the prediction error coefficients from summer 6 are forwarded to quantizer/coder 8. For other sub-bands, the MDCT 2 output is forwarded to quantizer/coder 8. A control signal output from FSS 7 includes a flag for each sub-band indicating whether long-term prediction is enabled for that sub-band.
The signals from FSS 7 are then quantized in quantizer/encoder 8 (e.g., using Huffman coding). Perceptual data from psychoacoustic model 3 is also used by quantizer/encoder 8. The output from quantizer/encoder 8 is then multiplexed in block 11 with control data from long-term predictor 1 (e.g., predication coefficients and pitch lag) and FSS 7 (sub-band flags). From block 11 the multiplexed data is then provided to a communication channel (e.g., a radio or internet transmission) or storage medium. The output from quantizer/coder 8 is also provided to inverse quantizer 9. The output of inverse quantizer 9 is forwarded to inverse frequency selective switch (IFSS) 10, as is the output from MDCT 5 and control signals (sub-band flags) from FSS 7. IFSS 10 then provides, as to each sub-band for which quantized prediction error coefficients were transmitted on the bit stream, the sum of the de-quantized prediction error coefficients and the output from MDCT 5. As to each sub-band for which the quantized MDCT 2 output was transmitted on the bit stream, IFSS provides the dequantized MDCT 2 output. The output from IFSS is then converted back to the time domain by IMDCT 4. The time domain output from IMDCT 4 is then provided to long-term predictor 1. A portion of the IMDCT 4 output is stored directly in the prediction buffer described above; other portions of that buffer hold fully-reconstructed (time domain) audio data frames generated by overlap-add (in the synthesis filter bank) of output from IMDCT 4.
FIG. 1B is a block diagram showing a general structure for an AAC-LTP decoder. The incoming bit stream is demultiplexed in block 15. The sub-band flags from FSS 7 (FIG. 1A) are provided to IFSS 17. The prediction coefficients and pitch lag from long-term predictor 1 in FIG. 1A are provided to pitch predictor 20. The quantized data from FSS 7 in FIG. 1A is dequantized in inverse quantizer 16, and then provided to IFSS 17. Based on the corresponding sub-band flag values, IFSS 17 determines whether long-term prediction was enabled for various sub-bands. For sub-bands where prediction was not enabled, IFSS 17 simply forwards the output of inverse quantizer 16 to IMDCT 18. For sub-bands where prediction was enabled, IFSS 17 adds the output of inverse quantizer 16 (i.e., the dequantized the prediction error coefficients) to the output of MDCT 21 (discussed below), and forwards the result to IMDCT 18. IMDCT 18 then transforms the output of IFSS 17 back to the time domain. The output of IMDCT 18 is then used for overlap-add in a synthesis filter bank (not shown) to yield a fully-reconstructed time domain signal that is a close replica of the original audio signal input in FIG. 1A. This fully-reconstructed time domain signal can then be processed by a digital to analog converter (not shown in FIG. 1B) for playback on, e.g., one or more speakers.
Recent portions of the time domain output from IMDCT 18 and of the fully reconstructed time domain signal from the synthesis filter bank are also stored in long-term prediction (LTP) buffer 19. LTP buffer 19 has the same dimensions as, and is intended to replicate the contents of, the buffer within the long-term predictor 1 of FIG. 1A. Data from LTP buffer 19 is used by pitch predictor 20 (in conjunction with prediction coefficients and pitch lag values) to predict the incoming audio signal in the time domain. The output of pitch predictor 20 corresponds to the output of long-term predictor 1 provided to MDCT 5 in FIG. 1A. The output from pitch predictor 20 is then converted to the frequency domain in MDCT 21, with the output of MDCT 21 provided to IFSS 17.
The conventional structure of LTP buffer 19 (as prescribed by the applicable MPEG-4 standards) is shown in FIG. 1C. Frame t−1 is the most recent fully-reconstructed time domain signal formed by overlap-add of time domain signals in the synthesis filter bank (not shown) of the decoder. Frame t is the time domain signal output from IMDCT 18, and is the aliased time domain signal to be used for overlap-add in the next frame to be output by the synthesis filter bank. Frame t−2 is the fully-reconstructed frame from a previous time period. The dimension (or length) N of each frame is 1024 samples. The broken line block on the right side of the LTP buffer represents a frame of 1024 zero-amplitude samples. This all-zero block is not an actual part of LTP buffer 19. Instead, it is used to conceptually indicate the location of the zero lag point. Specifically, when the value for pitch lag is at its maximum, 2048 time domain samples are predicted based on the 2048 samples in frames t−1 and t−2. When the pitch lag is between the minimum and maximum (e.g., at the point indicated as lag L), the 2048 samples prior to the pitch lag location (i.e., to the right of point L in FIG. 1C) are used to predict 2048 samples. When pitch lag is less that 1024, zeros are used for “samples” 1023 and below from the LTP buffer. For example, when the pitch lag is at its minimum (zero lag), the 1024 samples in the t frame and 1024 zero amplitude samples are used to predict 2048 samples. Although the use of the all-zero amplitudes results in less accurate sound reproduction, less memory is needed for the LTP buffer. Because zero or very low lag values occur relatively infrequently, overall sound quality is not seriously affected.
A decoder such as in FIG. 1B and the associated LTP buffer of FIG. 1C are often used in a mobile device such as a portable music player or mobile terminal. Such devices frequently have limited computational and memory resources. Adding additional memory and processing capacity is often expensive, thereby increasing overall cost of the device. Because a decoder and buffer use significant amounts of those resources, there may be limited excess capacity to accommodate additional features. For example, it is often desirable for audio playback devices to have a fast forward capability. If the output rate of the audio decoder is increased, numerous decoding operations must be performed at an even higher rate. As another example, a device that is decoding and playing an audio stream may need to briefly perform some other task (e.g., respond to an incoming telephone call or other communication). Unless processing and memory capacity is increased, or unless the processing and memory needed for audio decoding and playback can be reduced, the device may be unable to simultaneously perform multiple tasks.