The present invention relates to audio coding, such as the so-called USAC codec (USAC=Unified Speech and Audio Coding) and, in particular, the frame element length transmission.
In recent years, several audio codecs have been made available, each audio codec being specifically designed to fit to a dedicated application. Mostly, these audio codecs are able to code more than one audio channel or audio signal in parallel. Some audio codecs are even suitable for differently coding audio content by differently grouping audio channels or audio objects of the audio content and subjecting these groups to different audio coding principles. Even further, some of these audio codecs allow for the insertion of extension data into the bitstream so as to accommodate for future extensions/developments of the audio codec.
One example of such audio codecs is the USAC codec as defined in ISO/IEC CD 23003-3. This standard, named “Information Technology—MPEG Audio Technologies—Part 3: Unified Speech and Audio Coding”, describes in detail the functional blocks of a reference model of a call for proposals on unified speech and audio coding.
FIGS. 5a and 5b illustrate encoder and decoder block diagrams. In the following, the general functionality of the individual blocks is briefly explained. Thereupon, the problems in putting all of the resulting syntax portions together into a bitstream is explained with respect to FIG. 6.
FIGS. 5a and 5b illustrate encoder and decoder block diagrams. The block diagrams of the USAC encoder and decoder reflect the structure of MPEG-D USAC coding. The general structure can be described like this: First there is a common pre/post-processing consisting of an MPEG Surround (MPEGS) functional unit to handle stereo or multi-channel processing and an enhanced SBR (eSBR) unit which handles the parametric representation of the higher audio frequencies in the input signal. Then there are two branches, one consisting of a modified Advanced Audio Coding (AAC) tool path and the other consisting of a linear prediction coding (LP or LPC domain) based path, which in turn features either a frequency domain representation or a time domain representation of the LPC residual. All transmitted spectra for both, AAC and LPC, are represented in MDCT domain following quantization and arithmetic coding. The time domain representation uses an ACELP excitation coding scheme.
The basic structure of the MPEG-D USAC is shown in FIG. 5a and FIG. 5b. The data flow in this diagram is from left to right, top to bottom. The functions of the decoder are to find the description of the quantized audio spectra or time domain representation in the bitstream payload and decode the quantized values and other reconstruction information.
In case of transmitted spectral information the decoder shall reconstruct the quantized spectra, process the reconstructed spectra through whatever tools are active in the bitstream payload in order to arrive at the actual signal spectra as described by the input bitstream payload, and finally convert the frequency domain spectra to the time domain. Following the initial reconstruction and scaling of the spectrum reconstruction, there are optional tools that modify one or more of the spectra in order to provide more efficient coding.
In case of transmitted time domain signal representation, the decoder shall reconstruct the quantized time signal, process the reconstructed time signal through whatever tools are active in the bitstream payload in order to arrive at the actual time domain signal as described by the input bitstream payload.
For each of the optional tools that operate on the signal data, the option to “pass through” is retained, and in all cases where the processing is omitted, the spectra or time samples at its input are passed directly through the tool without modification.
In places where the bitstream changes its signal representation from time domain to frequency domain representation or from LP domain to non-LP domain or vice versa, the decoder shall facilitate the transition from one domain to the other by means of an appropriate transition overlap-add windowing.
eSBR and MPEGS processing is applied in the same manner to both coding paths after transition handling.
The input to the bitstream payload demultiplexer tool is the MPEG-D USAC bitstream payload. The demultiplexer separates the bitstream payload into the parts for each tool, and provides each of the tools with the bitstream payload information related to that tool.
The outputs from the bitstream payload demultiplexer tool are:                Depending on the core coding type in the current frame either:                    the quantized and noiselessly coded spectra represented by            scale factor information            arithmetically coded spectral lines                        or: linear prediction (LP) parameters together with an excitation signal represented by either:                    quantized and arithmetically coded spectral lines (transform coded excitation, TCX) or            ACELP coded time domain excitation                        The spectral noise filling information (optional)        The M/S decision information (optional)        The temporal noise shaping (TNS) information (optional)        The filterbank control information        The time unwarping (TW) control information (optional)        The enhanced spectral bandwidth replication (eSBR) control information (optional)        The MPEG Surround (MPEGS) control information        
The scale factor noiseless decoding tool takes information from the bitstream payload demultiplexer, parses that information, and decodes the Huffman and DPCM coded scale factors.
The input to the scale factor noiseless decoding tool is:
The scale factor information for the noiselessly coded spectra
The output of the scale factor noiseless decoding tool is:
The decoded integer representation of the scale factors:
The spectral noiseless decoding tool takes information from the bitstream payload demultiplexer, parses that information, decodes the arithmetically coded data, and reconstructs the quantized spectra. The input to this noiseless decoding tool is:
The noiselessly coded spectra
The output of this noiseless decoding tool is:
The quantized values of the spectra
The inverse quantizer tool takes the quantized values for the spectra, and converts the integer values to the non-scaled, reconstructed spectra. This quantizer is a companding quantizer, whose companding factor depends on the chosen core coding mode.
The input to the Inverse Quantizer tool is:
The quantized values for the spectra
The output of the inverse quantizer tool is:
The un-scaled, inversely quantized spectra
The noise filling tool is used to fill spectral gaps in the decoded spectra, which occur when spectral value are quantized to zero e.g. due to a strong restriction on bit demand in the encoder. The use of the noise filling tool is optional.
The inputs to the noise filling tool are:
The un-scaled, inversely quantized spectra
Noise filling parameters
The decoded integer representation of the scale factors
The outputs to the noise filling tool are:                The un-scaled, inversely quantized spectral values for spectral lines which were previously quantized to zero.        Modified integer representation of the scale factors        
The resealing tool converts the integer representation of the scale factors to the actual values, and multiplies the un-scaled inversely quantized spectra by the relevant scale factors.
The inputs to the scale factors tool are:
The decoded integer representation of the scale factors
The un-scaled, inversely quantized spectra
The output from the scale factors tool is:
The scaled, inversely quantized spectra
For an overview over the M/S tool, please refer to ISO/IEC 14496-3:2009, 4.1.1.2.
For an overview over the temporal noise shaping (TNS) tool, please refer to ISO/IEC 14496-3:2009, 4.1.1.2.
The filterbank/block switching tool applies the inverse of the frequency mapping that was carried out in the encoder. An inverse modified discrete cosine transform (IMDCT) is used for the filterbank tool. The IMDCT can be configured to support 120, 128, 240, 256, 480, 512, 960 or 1024 spectral coefficients.
The inputs to the filterbank tool are:
The (inversely quantized) spectra
The filterbank control information
The output(s) from the filterbank tool is (are):
The time domain reconstructed audio signal(s).
The time-warped filterbank/block switching tool replaces the normal filterbank/block switching tool when the time warping mode is enabled. The filterbank is the same (IMDCT) as for the normal filterbank, additionally the windowed time domain samples are mapped from the warped time domain to the linear time domain by time-varying resampling.
The inputs to the time-warped filterbank tools are:
The inversely quantized spectra
The filterbank control information
The time-warping control information
The output(s) from the filterbank tool is (are):
The linear time domain reconstructed audio signal(s).
The enhanced SBR (eSBR) tool regenerates the highband of the audio signal. It is based on replication of the sequences of harmonics, truncated during encoding. It adjusts the spectral envelope of the generated highband and applies inverse filtering, and adds noise and sinusoidal components in order to recreate the spectral characteristics of the original signal.
The input to the eSBR tool is:                The quantized envelope data        Misc. control data        a time domain signal from the frequency domain core decoder or the ACELP/TCX core decoder        
The output of the eSBR tool is either:
a time domain signal or
a QMF-domain representation of a signal, e.g. in the MPEG Surround tool is used.
The MPEG Surround (MPEGS) tool produces multiple signals from one or more input signals by applying a sophisticated upmix procedure to the input signal(s) controlled by appropriate spatial parameters. In the USAC context MPEGS is used for coding a multi-channel signal, by transmitting parametric side information alongside a transmitted downmixed signal.
The input to the MPEGS tool is:
a downmixed time domain signal or
a QMF-domain representation of a downmixed signal from the eSBR tool
The output of the MPEGS tool is:
a multi-channel time domain signal
The Signal Classifier tool analyses the original input signal and generates from it control information which triggers the selection of the different coding modes. The analysis of the input signal is implementation dependent and will try to choose the optimal core coding mode for a given input signal frame. The output of the signal classifier can (optionally) also be used to influence the behavior of other tools, for example MPEG Surround, enhanced SBR, time-warped filterbank and others.
The input to the signal Classifier tool is:
the original unmodified input signal
additional implementation dependent parameters
The output of the Signal Classifier tool is:                a control signal to control the selection of the core codec (non-LP filtered frequency domain coding, LP filtered frequency domain or LP filtered time domain coding)        
The ACELP tool provides a way to efficiently represent a time domain excitation signal by combining a long term predictor (adaptive codeword) with a pulse-like sequence (innovation codeword). The reconstructed excitation is sent through an LP synthesis filter to form a time domain signal.
The input to the ACELP tool is:
adaptive and innovation codebook indices
adaptive and innovation codes gain values
other control data
inversely quantized and interpolated LPC filter coefficients
The output of the ACELP tool is:
The time domain reconstructed audio signal
The MDCT based TCX decoding tool is used to turn the weighted LP residual representation from an MDCT-domain back into a time domain signal and outputs a time domain signal including weighted LP synthesis filtering. The IMDCT can be configured to support 256, 512, or 1024 spectral coefficients.
The input to the TCX tool is:
The (inversely quantized) MDCT spectra
inversely quantized and interpolated LPC filter coefficients
The output of the TCX tool is:
The time domain reconstructed audio signal
The technology disclosed in ISO/IEC CD 23003-3, which is incorporated herein by reference allows the definition of channel elements which are, for example, single channel elements only containing payload for a single channel or channel pair elements comprising payload for two channels or LFE (Low-Frequency Enhancement) channel elements comprising payload for an LFE channel.
Naturally, the USAC codec is not the only codec which is able to code and transfer information on a more complicated audio codec of more than one or two audio channels or audio objects via one bitstream. Accordingly, the USAC codec merely served as a concrete example.
FIG. 6 shows a more general example of an encoder and decoder, respectively, both depicted in one common scenery where the encoder encodes audio content 10 into a bitstream 12, with the decoder decoding the audio content or at least a portion thereof, from the bitstream 12. The result of the decoding, i.e. the reconstruction, is indicated at 14. As illustrated in FIG. 6, the audio content 10 may be composed of a number of audio signals 16. For example, the audio content 10 may be a spatial audio scene composed of a number of audio channels 16. Alternatively, the audio content 10 may represent a conglomeration of audio signals 16 with the audio signals 16 representing, individually and/or in groups, individual audio objects which may be put together into an audio scene at the discretion of a decoder's user so as to obtain the reconstruction 14 of the audio content 10 in the form of, for example, a spatial audio scene for a specific loudspeaker configuration. The encoder encodes the audio content 10 in units of consecutive time periods. Such a time period is exemplarily shown at 18 in FIG. 6. The encoder encodes the consecutive periods 18 of the audio content 10 using the same manner: that is, the encoder inserts into the bitstream 12 one frame 20 per time period 18. In doing so, the encoder decomposes the audio content within the respective time period 18 into frame elements, the number and the meaning/type of which is the same for each time period 18 and frame 20, respectively. With respect to the USAC codec outlined above, for example, the encoder encodes the same pair of audio signals 16 in every time period 18 into a channel pair element of the elements 22 of the frames 20, while using another coding principle, such as single channel encoding for another audio signal 16 so as to obtain a single channel element 22 and so forth. Parametric side information for obtaining an upmix of audio signals out of a downmix audio signal as defined by one or more frame elements 22 is collected to form another frame element within frame 20. In that case, the frame element conveying this side information relates to, or forms a kind of extension data for, other frame elements. Naturally, such extensions are not restricted to multi-channel or multi-object side information.
One possibility is to indicate within each frame element 22 of what type the respective frame element is. Advantageously, such a procedure allows for coping with future extensions of the bitstream syntax. Decoders which are not able to deal with certain frame element types, would simply skip the respective frame elements within the bitstream by exploiting respective length information within these frame elements. Moreover, it is possible to allow for standard conform decoders of different type: some are able to understand a first set of types, while others understand and can deal with another set of types; alternative element types would simply be disregarded by the respective decoders. Additionally, the encoder would be able to sort the frame elements at his discretion so that decoders which are able to process such additional frame elements may be fed with the frame elements within the frames 20 in an order which, for example, minimizes buffering needs within the decoder. Disadvantageously, however, the bitstream would have to convey frame element type information per frame element, the necessity of which, in turn, negatively affects the compression rate of the bitstream 12 on the one hand and the decoding complexity on the other hand as the parsing overhead for inspecting the respective frame element type information occurs within each frame element.
Moreover, in order to allow for skipping frame elements to be skipped, the bitstream 12 has to convey the afore-mentioned length information concerning the frame elements potentially to be skipped. This transmission in turn reduces the compression efficiency.
Naturally, it would be possible to otherwise fix the order among the frame elements 22, such as per convention, but such a procedure prevents encoders from having the freedom to rearrange frame elements due to, for example, specific properties of future extension frame elements necessitating or suggesting, for example, a different order among the frame elements.
Further, it would be favorable if the transmission of the length information could be performed more effectively.
Accordingly, there is a need for another concept of a bitstream, encoder and decoder, respectively.