1. Field of the Invention
The present invention relates to multi-channel audio coding and in particular to a concept of generating and using a parametric representation of a multi-channel audio signal that is fully backwards compatible to parametric stereo playback environments.
2. Description of the Related Art
The present invention relates to coding of multi-channel representations of audio signals using spatial audio parameters in a manner that is compatible with coding of 2-channel stereo signals using parametric stereo parameters. The present invention teaches new methods for efficient coding of both spatial audio parameters and parametric stereo parameters and for embedding the coded parameters in a bitstream in a backward compatible manner. In particular it aims at minimizing the overall bitrate for the parametric stereo and spatial audio parameters in the backward compatible bitstream without compromising the quality of the decoded stereo or multi-channel audio signal. When a slightly compromised quality of the decoded stereo signal is acceptable, the overall bitrate can be reduced even further.
Recently, multi-channel audio reproduction techniques are becoming more and more important. Aiming at an efficient transmission of multi-channel audio signals having 5 or more separate audio channels, several ways of compressing a stereo or multi-channel signal have been developed. Recent approaches for the parametric coding of multi-channel audio signals (parametric stereo (PS), Binaural Cue Coding (BCC) etc.) represent a multi-channel audio signal by means of a down-mix signal (could be monophonic or comprise several channels) and parametric side information, also referred to as “spatial cues”, characterizing its perceived spatial sound stage.
A multi-channel encoding device generally receives—as input—at least two channels, and outputs one or more carrier channels and parametric data. The parametric data is derived such that, in a decoder, an approximation of the original multi-channel signal can be calculated. Normally, the carrier channel (channels) will include subband samples, spectral coefficients, time domain samples, etc., which provide a comparatively fine representation of the underlying signal, while the parametric data do not include such samples of spectral coefficients but include control parameters for controlling a certain reconstruction algorithm instead. Such a reconstruction could comprise weighting by multiplication, time shifting, frequency shifting, phase shifting, etc. Thus, the parametric data includes only a comparatively coarse representation of the signal or the associated channel.
The binaural cue coding (BCC) technique is described in a number of publications, as in “Binaural Cue Coding applied to Stereo and Multi-Channel Audio Compression”, C. Faller, F. Baumgarte, AES convention paper 5574, May 2002, Munich, in the 2 ICASSP publications “Estimation of auditory spatial cues for binaural cue coding”, and “Binaural cue coding: a normal and efficient representation of spatial audio”, both authored by C. Faller, and F. Baumgarte, Orlando, Fla., May 2002.
In BCC encoding, a number of audio input channels are converted to a spectral representation using a DFT (Discrete Fourier Transform) based transform with overlapping windows. The resulting uniform spectrum is then divided into non-overlapping partitions. Each partition has a bandwidth proportional to the equivalent rectangular bandwidth (ERB). Then, spatial parameters called ICLD (Inter-Channel Level Difference) and ICTD (Inter-Channel Time Difference) are estimated for each partition. The ICLD parameter describes a level difference between two channels and the ICTD parameter describes the time difference (phase shift) between two signals of different channels. The level differences and the time differences are normally given for each channel with respect to a reference channel. After the derivation of these parameters, the parameters are quantized and finally encoded for transmission.
Although ICLD and ICTD parameters represent the most important sound source localization parameters, a spatial representation using these parameters can be enhanced by introducing additional parameters.
A related technique, called “parametric stereo” describes the parametric coding of a two-channel stereo signal based on a transmitted mono signal plus parameter side information. Three types of spatial parameters, referred to as inter-channel intensity difference (IIDs), inter-channel phase differences (IPDs), and inter-channel coherence (IC) are introduced. The extension of the spatial parameter set with a coherence parameter (correlation parameter) enables a parametrization of the perceived spatial “diffuseness” or spatial “compactness” of the sound stage. Parametric stereo is described in more detail in: “Parametric Coding of stereo audio”, J. Breebaart, S. van de Par, A. Kohlrausch, E. Schuijers (2005) Eurasip, J. Applied Signal Proc. 9, pages 1305-1322)”, in “High-Quality Parametric Spatial Audio Coding at Low Bitrates”, J. Breebaart, S. van de Par, A. Kohlrausch, E. Schuijers, AES 116th Convention, Preprint 6072, Berlin, May 2004, and in “Low Complexity Parametric Stereo Coding”, E. Schuijers, J. Breebaart, H. Purnhagen, J. Engdegard, AES 116th Convention, Preprint 6073, Berlin, May 2004.
As mentioned above, systems for parametric stereo coding as well as for spatial audio coding have been developed recently. As in parametric stereo a two-channel stereo audio signal is represented by means of a mono downmix audio signal and additional side information that carries stereo parameters (see PCT/SE02/01372 “Efficient and scalable Parametric Stereo Coding for Low Bitrate Audio Coding Applications”), a legacy parametric stereo decoder reconstructs a two-channel stereo signal from the mono signal and the side information.
In spatial audio coding schemes, a multi-channel surround audio signal is represented by means of a mono or stereo downmix audio signal and additional side information that carries spatial audio parameters. A widely known example is the 5.1 channel configuration used for home entertainment systems.
A legacy spatial audio decoder reconstructs the 5.1 multi-channel signal based on the mono or stereo signal and the additional spatial audio parameters.
Typically downmix signals employed in parametric stereo or spatial audio coding systems are additionally encoded, using low bit rate perceptual audio coding techniques (like MPEG AAC) to further reduce the required transmission bandwidth for transmission of the different signal types. Furthermore the downmix signal is normally combined with the parametric stereo or with the spatial audio side information in a bitstream in a way, that assures backward compatibility with legacy decoders, that is with decoders that are not operative to process the parametric stereo or spatial audio parameters. In this way, a legacy audio decoder only reconstructs the mono or stereo downmix signal transmitted. When a decoder implementing parametric stereo or spatial audio coding is used, the decoder will also recover the side information embedded in the bitstream and reconstruct the full two-channel stereo or 5.1 channel surround signal.
When spatial audio coding is used based on a mono downmix signal it is furthermore desirable to increase the backwards compatibility by providing a signal such that not only a legacy perceptual audio decoder can derive the mono downmix signal, but that additionally a parametric stereo decoding of such a bitstream is possible for a parametric stereo decoder that does not support spatial audio decoding. To achieve this goal, it is necessary to include both information, the parametric stereo side information and the spatial audio side information in the bitstream. This obvious approach leads to an undesirably high amount of side information within the bitstream. That would mean for a scenario where a total maximum bit rate has to be maintained to convey the mono signal and the side information, that an increase in side information would lead to less data rate available for the perceptually encoded mono downmix, which obviously reduces the audio quality of the decoded mono downmix signal.
Another prior art approach of simultaneously including both the parametric stereo and spatial audio parameters and the side information, requires a set of spatial audio parameters that are structured such, that a subset of these parameters permits to reconstruct a two-channel stereo signal from the mono downmix signal. This subset is embedded as parametric side information within the bitstream in a way compatible with parametric stereo bit streams, while remaining spatial audio parameters that do not belong to the subset are embedded as spatial audio side information in the bitstream compatible with spatial audio coders. On the decoder side, a decoder implementing only parametric stereo will reconstruct a two-channel stereo signal based on the subset of parameters that are embedded as parametric stereo side information. On the other hand, a decoder implementing spatial audio will recover the parametric stereo subset and the remaining spatial audio parameters. With this complete set of spatial parameters, the multi-channel signal can be reconstructed.
This approach, however, has the drawback that it compromises the audio quality of either the backward compatible parametric stereo reconstruction or the multi-channel reconstruction. This is evident, since in the first case, the subset of parameters that are also used as spatial audio parameters describe the interrelation between two channels of a 5.1 signal. The most natural choice would be the left-front (l) and the right-front (r) channel, which, however, can differ substantially from the correct values for the relationship of the left (l0) and right (r0) channels of a stereo downmix. In the second case the correct values of a stereo downmix form said first subset, which means that they are used to describe an interrelation between the left-front and the right-front channel of a multi-channel surround signal. This, however, can lead to a significant imperfection of the spatial audio reconstruction due to quantization of the parameters, which is required, in order to embed them in the bitstream in a multi-channel compatible way.