In an audio encoding system an incoming time domain audio signal is compressed such that the bitrate needed to represent the signal is significantly reduced. Ideally, the bitrate of the encoded signal is such that it fits into the constraints of the transmission channel or minimizes the size of the encoded file. The former is typically being used in real-time communication and streaming services whereas the latter is being deployed more and more extensively when storing audio content locally or via downloading at high audio quality.
Typically the audio encoder aims to minimize the perceptual distortion at any given bitrate. However, the lower the bitrate, the more challenging it is to the encoder to satisfy the target bitrate and zero perceived distortion. Another encoding scenario is minimization of the encoded file size while keeping the perceptual distortion inaudible.
In both cases advanced encoding models and techniques need to be applied to maximize the end user experience. Typically it is the (encoding) performance with the worst-case signals (i.e., signals that are difficult to encode) that ultimately defines the overall performance of any encoding system. Another factor in defining the overall performance of any encoding system is the encoding speed and the resources needed in order for the given bitrate or audio quality level to be achieved. For commercial use, and especially for mobile use, encoding speed and memory requirements commonly play a significant role.
In an attempt to achieve lower bitrates without reducing the perceptual distortion, new audio coding methods should be explored and fully utilized. One of these methods that has been extensively used in state-of-the-art audio coding is efficient coding of stereo signals. Perceptual audio encoders encode the input signal in the frequency domain, as human auditory properties can be best described in the frequency domain. The spectral samples are typically quantized on a frequency band basis, and the quantizer shapes the quantization noise by either increasing or decreasing the corresponding quantizer step size until the noise is just below the auditory masking threshold.
On one hand, the introduced perceptual distortion is inaudible to the human ear. On the other hand, this limits the lowest possible bitrate. It is known from literature that coding of stereo signals can be best described and implemented by means of Mid-Side (M/S) and Intensity Stereo (IS) coding. In M/S stereo coding, the left and right (L/R) input channels are transformed into sum and difference signals. (See J. D. Johnston and A. J. Ferreira, “Sum-difference stereo transform coding”, ICASSP-92 Conference Record, 1992, pp. 569-572 (hereinafter “Johnston”), the contents of which are hereby incorporated herein by reference in their entirety). In particular, the mid channel is the average of the left and right channels, while the side channel is the difference between the two channels divided by two. The channel combination (i.e., L/R vs. M/S) requiring the lowest number of bits to achieve zero perceived distortion is then selected. For maximum coding efficiency this transformation is done both in a frequency and time dependant manner. M/S stereo coding is especially useful for high quality, high bitrate stereophonic coding.
In the attempt to achieve lower stereo bitrates, IS stereo coding has typically been used in combination with M/S coding. In IS coding, a portion of the spectra is coded only in mono mode and the stereo image is reconstructed by transmitting different scaling factors for the left and right channels. (See U.S. Pat. No. 5,539,829, entitled “Subband coded digital transmission system using some composite signal” to U.S. Philips Corporation, issued July 1996 (hereinafter “the '829 patent.”) and U.S. Pat. No. 5,606,618, entitled “Subband coded digital transmission system using some composite signals” to U.S. Phillips Corporation, issued February, 1997 (hereinafter the '618 patent.”), the contents of each of which are hereby incorporated herein by reference in their entirety). However, it is well known that IS stereo performs poorly at low frequencies thus limiting the usable bitrate range.
At low bitrates (e.g., below 1.5 bps), the use of M/S stereo coding is typically not able to preserve the full spatial image due to a shortage of available bits. Spectral leakage, also known as cross talk, from one channel to the other often occurs. This kind of degradation will have significant impact on output quality. The degradation is especially disturbing when the spatial image is not equally distributed between the left and right channels.
A need, therefore exists, for improving encoding across a range of bitrates.