The present invention relates to audio coding and particularly to high quality and low bitrate coding such as known from the so-called USAC coding (USAC=Unified Speech and Audio Coding).
The USAC coder is defined in ISO/IEC CD 23003-3. This standard named “Information technology—MPEG audio technologies—Part 3: Unified speech and audio coding” describes in detail the functional blocks of a reference model of a call for proposals on unified speech and audio coding.
FIGS. 10a and 10b illustrate encoder and decoder block diagrams. The block diagrams of the USAC encoder and decoder reflect the structure of MPEG-D USAC coding. The general structure can be described like this: First there is a common pre/post-processing consisting of an MPEG Surround (MPEGS) functional unit to handle stereo or multi-channel processing and an enhanced SBR (eSBR) unit which handles the parametric representation of the higher audio frequencies in the input signal. Then there are two branches, one consisting of a modified Advanced Audio Coding (AAC) tool path and the other consisting of a linear prediction coding (LP or LPC domain) based path, which in turn features either a frequency domain representation or a time domain representation of the LPC residual. All transmitted spectra for both, AAC and LPC, are represented in MDCT domain following quantization and arithmetic coding. The time domain representation uses an ACELP excitation coding scheme.
The basic structure of the MPEG-D USAC is shown in FIG. 10a and FIG. 10b. The data flow in this diagram is from left to right, top to bottom. The functions of the decoder are to find the description of the quantized audio spectra or time domain representation in the bitstream payload and decode the quantized values and other reconstruction information.
In case of transmitted spectral information the decoder shall reconstruct the quantized spectra, process the reconstructed spectra through whatever tools are active in the bitstream payload in order to arrive at the actual signal spectra as described by the input bitstream payload, and finally convert the frequency domain spectra to the time domain. Following the initial reconstruction and scaling of the spectrum reconstruction, there are optional tools that modify one or more of the spectra in order to provide more efficient coding.
In case of transmitted time domain signal representation, the decoder shall reconstruct the quantized time signal, process the reconstructed time signal through whatever tools are active in the bitstream payload in order to arrive at the actual time domain signal as described by the input bitstream payload.
For each of the optional tools that operate on the signal data, the option to “pass through” is retained, and in all cases where the processing is omitted, the spectra or time samples at its input are passed directly through the tool without modification.
In places where the bitstream changes its signal representation from time domain to frequency domain representation or from LP domain to non-LP domain or vice versa, the decoder shall facilitate the transition from one domain to the other by means of an appropriate transition overlap-add windowing.
eSBR and MPEGS processing is applied in the same manner to both coding paths after transition handling.
The input to the bitstream payload demultiplexer tool is the MPEG-D USAC bitstream payload. The demultiplexer separates the bitstream payload into the parts for each tool, and provides each of the tools with the bitstream payload information related to that tool.
The outputs from the bitstream payload demultiplexer tool are:                Depending on the core coding type in the current frame either:                    the quantized and noiselessly coded spectra represented by            scale factor information            arithmetically coded spectral lines                        or: linear prediction (LP) parameters together with an excitation signal represented by either:                    quantized and arithmetically coded spectral lines (transform coded excitation, TCX) or            ACELP coded time domain excitation                        The spectral noise filling information (optional)        The M/S decision information (optional)        The temporal noise shaping (TNS) information (optional)        The filterbank control information        The time unwarping (TW) control information (optional)        The enhanced spectral bandwidth replication (eSBR) control information (optional)        The MPEG Surround (MPEGS) control information        
The scale factor noiseless decoding tool takes information from the bitstream payload demultiplexer, parses that information, and decodes the Huffman and DPCM coded scale factors.
The input to the scale factor noiseless decoding tool is:                The scale factor information for the noiselessly coded spectra        
The output of the scale factor noiseless decoding tool is:                The decoded integer representation of the scale factors:        
The spectral noiseless decoding tool takes information from the bitstream payload demultiplexer, parses that information, decodes the arithmetically coded data, and reconstructs the quantized spectra. The input to this noiseless decoding tool is:                The noiselessly coded spectra        
The output of this noiseless decoding tool is:                The quantized values of the spectra        
The inverse quantizer tool takes the quantized values for the spectra, and converts the integer values to the non-scaled, reconstructed spectra. This quantizer is a companding quantizer, whose companding factor depends on the chosen core coding mode.
The input to the Inverse Quantizer tool is:                The quantized values for the spectra        
The output of the inverse quantizer tool is:                The un-scaled, inversely quantized spectra        
The noise filling tool is used to fill spectral gaps in the decoded spectra, which occur when spectral value are quantized to zero e.g. due to a strong restriction on bit demand in the encoder. The use of the noise filling tool is optional.
The inputs to the noise filling tool are:                The un-scaled, inversely quantized spectra        Noise filling parameters        The decoded integer representation of the scale factors        
The outputs to the noise filling tool are:                The un-scaled, inversely quantized spectral values for spectral lines which were previously quantized to zero.        Modified integer representation of the scale factors        
The rescaling tool converts the integer representation of the scale factors to the actual values, and multiplies the un-scaled inversely quantized spectra by the relevant scale factors.
The inputs to the scale factors tool are:                The decoded integer representation of the scale factors        The un-scaled, inversely quantized spectra        
The output from the scale factors tool is:                The scaled, inversely quantized spectra        
For an overview over the M/S tool, please refer to ISO/IEC 14496-3:2009, 4.1.1.2.
For an overview over the temporal noise shaping (TNS) tool, please refer to ISO/IEC 14496-3:2009, 4.1.1.2.
The filterbank/block switching tool applies the inverse of the frequency mapping that was carried out in the encoder. An inverse modified discrete cosine transform (IMDCT) is used for the filterbank tool. The IMDCT can be configured to support 120, 128, 240, 256, 480, 512, 960 or 1024 spectral coefficients.
The inputs to the filterbank tool are:                The (inversely quantized) spectra        The filterbank control information        
The output(s) from the filterbank tool is (are):                The time domain reconstructed audio signal(s).        
The time-warped filterbank/block switching tool replaces the normal filterbank/block switching tool when the time warping mode is enabled. The filterbank is the same (IMDCT) as for the normal filterbank, additionally the windowed time domain samples are mapped from the warped time domain to the linear time domain by time-varying resampling.
The inputs to the time-warped filterbank tools are:                The inversely quantized spectra        The filterbank control information        The time-warping control information        
The output(s) from the filterbank tool is (are):                The linear time domain reconstructed audio signal(s).        
The enhanced SBR (eSBR) tool regenerates the highband of the audio signal. It is based on replication of the sequences of harmonics, truncated during encoding. It adjusts the spectral envelope of the generated highband and applies inverse filtering, and adds noise and sinusoidal components in order to recreate the spectral characteristics of the original signal.
The input to the eSBR tool is:                The quantized envelope data        Misc. control data        a time domain signal from the frequency domain core decoder or the ACELP/TCX core decoder        
The output of the eSBR tool is either:                a time domain signal or        a QMF-domain representation of a signal, e.g. in the MPEG Surround tool is used.        
The MPEG Surround (MPEGS) tool produces multiple signals from one or more input signals by applying a sophisticated upmix procedure to the input signal(s) controlled by appropriate spatial parameters. In the USAC context MPEGS is used for coding a multi-channel signal, by transmitting parametric side information alongside a transmitted downmixed signal.
The input to the MPEGS tool is:                a downmixed time domain signal or        a QMF-domain representation of a downmixed signal from the eSBR tool        
The output of the MPEGS tool is:                a multi-channel time domain signal        
The Signal Classifier tool analyses the original input signal and generates from it control information which triggers the selection of the different coding modes. The analysis of the input signal is implementation dependent and will try to choose the optimal core coding mode for a given input signal frame. The output of the signal classifier can (optionally) also be used to influence the behavior of other tools, for example MPEG Surround, enhanced SBR, time-warped filterbank and others.
The input to the signal Classifier tool is:                the original unmodified input signal        additional implementation dependent parameters        
The output of the Signal Classifier tool is:                a control signal to control the selection of the core codec (non-LP filtered frequency domain coding, LP filtered frequency domain or LP filtered time domain coding)        
The ACELP tool provides a way to efficiently represent a time domain excitation signal by combining a long term predictor (adaptive codeword) with a pulse-like sequence (innovation codeword). The reconstructed excitation is sent through an LP synthesis filter to form a time domain signal.
The input to the ACELP tool is:                adaptive and innovation codebook indices        adaptive and innovation codes gain values        other control data        inversely quantized and interpolated LPC filter coefficients        
The output of the ACELP tool is:                The time domain reconstructed audio signal        
The MDCT based TCX decoding tool is used to turn the weighted LP residual representation from an MDCT-domain back into a time domain signal and outputs a time domain signal including weighted LP synthesis filtering. The IMDCT can be configured to support 256, 512, or 1024 spectral coefficients.
The input to the TCX tool is:                The (inversely quantized) MDCT spectra        inversely quantized and interpolated LPC filter coefficients        
The output of the TCX tool is:                The time domain reconstructed audio signal        
The technology disclosed in ISO/IEC CD 23003-3, which is incorporated herein by reference allows the definition of channel elements which are, for example, single channel elements only containing payload for a single channel or channel pair elements comprising payload for two channels or LFE (Low-Frequency Enhancement) channel elements comprising payload for an LFE channel.
A five-channel multi-channel audio signal can, for example, be represented by a single channel element comprising the center channel, a first channel pair element comprising the left channel and the right channel, and a second channel pair element comprising the left surround channel (Ls) and the right surround channel (Rs). These different channel elements which together represent the multi-channel audio signal are fed into a decoder and are processed using the same decoder configuration. In accordance with conventional technology, the decoder configuration sent in the USAC specific config element was applied by the decoder to all channel elements and therefore the situation exists that elements of the configuration valid for all channel elements could not be selected for an individual channel element in an optimum way, but had to be set for all channel elements simultaneously. On the other hand, however, it has been found out that the channel elements for describing a straightforward five-channel multi-channel signal are very different from each other. The center channel being the single channel element has significantly different characteristics from the channel pair elements describing the left/right channels and the left surround/right surround channels, and additionally the characteristics of the two channel pair elements are also significantly different due to the fact that surround channels comprise information which is heavily different from the information comprised in the left and right channels.
The selection of configuration data for all channel elements together necessitated compromises so that a configuration has to be selected which is non-optimum for all channel elements, but which represents a compromise between all channel elements. Alternatively, the configuration has been selected to be optimum for one channel element, but this inevitably led to the situation that the configuration was non-optimum for the other channel elements. This, however, results in an increased bitrate for the channel elements having the non-optimum configuration or alternatively or additionally results in a reduced audio quality for these channel elements which do not have the optimum configuration settings.