Recently, multi-channel audio reproduction techniques are becoming more and more important. In the view of an efficient transmission of multi-channel audio signals having 5 or more separate audio channels, several ways of compressing a stereo or multi-channel signal have been developed. Recent approaches for the parametric coding of multi-channel audio signals (parametric stereo (PS), “Binaural Cue Coding” (BCC) etc.) represent a multi-channel audio signal by means of a down-mix signal (could be monophonic or comprise several channels) and parametric side information, also referred to as “spatial cues”, characterizing its perceived spatial sound stage.
A multi-channel encoding device generally receives—as input—at least two channels, and outputs one or more carrier channels and parametric data. The parametric data is derived such that, in a decoder, an approximation of the original multi-channel signal can be calculated. Normally, the carrier channel (channels) will include subband samples, spectral coefficients, time domain samples, etc., which provide a comparatively fine representation of the underlying signal, while the parametric data do not include such samples of spectral coefficients but include control parameters for controlling a certain reconstruction algorithm instead. Such a reconstruction could comprise weighting by multiplication, time shifting, frequency shifting, phase shifting, etc. Thus, the parametric data includes only a comparatively coarse representation of the signal or the associated channel.
The binaural cue coding (BCC) technique is described in a number of publications, as in “Binaural Cue Coding applied to Stereo and Multi-Channel Audio Compression”, C. Faller, F. Baumgarte, AES convention paper 5574, May 2002, Munich, in the 2 ICASSP publications “Estimation of auditory spatial cues for binaural cue coding”, and “Binaural cue coding: a normal and efficient representation of spatial audio”, both authored by C. Faller, and F. Baumgarte, Orlando, Fla., May 2002.
In BCC encoding, a number of audio input channels are converted to a spectral representation using a DFT (Discrete Fourier Transform) based transform with overlapping windows. The resulting uniform spectrum is then divided into non-overlapping partitions. Each partition has a bandwidth proportional to the equivalent rectangular bandwidth (ERB). Then, spatial parameters called ICLD (Inter-Channel Level Difference) and ICTD (Inter-Channel Time Difference) are estimated for each partition. The ICLD parameter describes a level difference between two channels and the ICTD parameter describes the time difference (phase shift) between two signals of different channels. The level differences and the time differences are normally given for each channel with respect to a reference channel. After the derivation of these parameters, the parameters are quantized and finally encoded for transmission.
Although ICLD and ICTD parameters represent the most important sound source localization parameters, a spatial representation using these parameters can be enhanced by introducing additional parameters.
A related technique, called “parametric stereo” describes the parametric coding of a two-channel stereo signal based on a transmitted mono signal plus parameter side information. There, 3 types of spatial parameters, referred to as inter-channel intensity difference (IIDs), inter-channel phase differences (IPDs), and inter-channel coherence (IC) are introduced. The extension of the spatial parameter set with a coherence parameter (correlation parameter) enables a parametrization of the perceived spatial “diffuseness” or spatial “compactness” of the sound stage. Parametric stereo is described in more detail in: “Parametric Coding of stereo audio”, J. Breebaart, S. van de Par, A. Kohlrausch, E. Schuijers (2005) Eurasip, J. Applied Signal Proc. 9, pages 1305-1322)”, in “High-Quality Parametric Spatial Audio Coding at Low Bitrates”, J. Breebaart, S. van de Par, A. Kohlrausch, E. Schuijers, AES 116th Convention, Preprint 6072, Berlin, May 2004, and in “Low Complexity Parametric Stereo Coding”, E. Schuijers, J. Breebaart, H. Purnhagen, J. Engdegard, AES 116th Convention, Preprint 6073, Berlin, May 2004.
The international publication Wo 2004/008805 A1 teaches, how a multi-channel audio signal can be advantageously compressed by combining several parametric stereo modules, thus realizing a hierarchical structure to derive a representation of the original multi-channel audio signal comprising a down-mix signal and parametric side information.
Within the BCC and parametric stereo (PS) approach, a representation of the level differences (also called intensity differences ICLD or energy differences IID) between audio channels is a vital part of a parametric representation of a stereophonic/multi-channel audio signal. Such information and other spatial parameters are transmitted from the encoder to the decoder for each time/frequency slot. In the view of coding efficiency, it is therefore of high interest to represent these parameters as compactly as possible while preserving audio quality.
In BCC coding, the level differences are represented relative to a so-called “reference channel” and are quantized on a uniform scale in units of dB relative to a reference channel. This does not optimally exploit the fact that channels with low level with respect to the reference channel are subject to a significant masking effect when listened to by human listeners. In the extreme case of a channel having no signal at all, the bandwidth used by parameters describing this particular channel is completely wasted. In the more common case, where one channel is much fainter than another channel, that is a listener can hardly hear the faint channel during the playback, a less precise reproduction of the faint channel would also lead to the same perceptual quality of the listener, as the faint signal is mainly masked by the stronger signal.
To explain the situation and the problems arising when encoding a multi-channel signal, reference is made to FIG. 10a where a commonly used 5-channel signal is illustrated. The 5-channel configuration is having a left rear channel 101 (A, having a signal a(t)), a left front channel 102 (B, having a signal b(t)), a center channel 103 (C, having a signal c(t)), a right front channel 104 (D, having a signal d(t)) and a right back channel 105 (E, having a signal e(t)). Intensity relations between single channels or channel pairs are marked with arrows. Hence, the intensity distribution between the front left channel 102 and the front right channel 104 is marked r1 (110), the intensity distribution between the left back channel and the right back channel is marked r4 (112). The intensity distribution between the combination of the left front channel 102 and the right front channel 104 and the center channel 103 is marked r2 (114) and the intensity distribution between the combination of the back channels and the combination of the front channels is marked r3 (116).
When, for example, a simple monologue is recorded, most of the energy would be contained in the center channel 103. In this example, especially the back channels will contain only little (or 0) energy. Therefore, parameters describing the properties of the back channels are merely wasted in this example, since mainly the center channel 102 or the front channels will be active during the play back.
Based on FIG. 10a, ways of computing the energy distribution between channels or channel combinations are described within the following paragraph.
FIG. 10a illustrates a multi channel parameterization for a five channel speaker set-up where the different audio channels are indicated by 101 to 105; a(t) 101 represents signal of the left surround channel, b(t) 102 represents the signal of the left front channel, c(t) 103 represents the signal of the center channel, d(t) 104 represents the signal of the right front channel, e(t) 105 represents the signal of the right surround channel. The speaker set-up is divided into a front part and a back part. The energy distribution between the entire front channel set-up (102, 103 and 104) and the back channels (101 and 105) are illustrated by the arrow in FIG. 10a and indicated by the r3 parameter. The energy distribution between the center channel 103 and the left front 102 and right front 103 channels are indicated by r2. The energy distribution between the left surround channel 101 and the right surround channel 105 is illustrated by r4. Finally, the energy distribution between the left front channel 102 and the right front channel 104 is given by r1. Since r1 to r4 are parameterizations of different regions it is also clear that beside energy distribution also other essential region properties can be parameterized, as for example the correlation between the regions. Additionally for each parameter r1 to r4 a local energy can be calculated. For example the local energy of r4 is the summed energy of channel A 101 and E 105.LocalEnergyr4=E[a2(t)]+E[e2(t)].
Where E[.] is the expected value as defined by
      E    ⁡          [              f        ⁡                  (          x          )                    ]        =            1      T        ⁢                  ∫        0        T            ⁢                        f          ⁡                      (                          x              ⁡                              (                t                )                                      )                          ⁢                              ⅆ            t                    .                    
FIG. 10b shows a multi-channel audio decoder built by hierarchically ordering parametric stereo modules, as for example described in WO 2004/008805 A1. Here, the audio channels 101 to 105, as introduced in FIG. 10a, are reproduced step by step from a single monophonic down-mix signal 120 (M) and corresponding side information by a first two-channel decoder 122, a second two-channel decoder 124, a third two-channel decoder 126, and a fourth two-channel decoder 128. As can be seen, in the treelike structure in FIG. 10b, the first two-channel decoder decomposes the monophonic down-mix signal 120 into two signals fed into the second and the third two-channel decoders 124 and 126. Therein, the channel fed into the third two-channel decoder 126 is a combined channel, being combined from the left back channel 101 and the right back channel 105. The channel fed into the second two-channel decoder 124 is a combination of the center channel 103 and a combined channel which is again being a combination of the front left channel 102 and of the front right channel 104.
Thus, after the second step of the hierarchical decoding, the left back channel 101, the right back channel 105, the center channel 103, and a combined channel, being a combination of the front left channel 102 and the front right channel 104 are reconstructed, using the transmitted spatial parameters, that are comprising a level parameter for use by each of the two-channel decoders 122, 124, and 126.
In the third step of the hierarchical decoding, the fourth two-channel decoder 128 derives the front left channel 102 and the front right channel 104, using a level information transmitted as side information for the fourth two-channel decoder 128. Using a prior art hierarchical decoder as shown in FIG. 10b, the desired energy for each single output channel follows from various different parametric stereo modules between the input signal and each output signal. In other words, the energy of a specific output channel can depend on the IID/ICLD parameters of multiple parametric stereo modules. In such a treelike structure of connected parametric stereo modules, also a non-uniform quantization of IID parameters can be applied within each parametric stereo module to produce IID values, which are then used by a decoder as part of the side information. This would exploit the benefits of non-uniform IID quantization locally (i.e. within each parametric stereo module individually), nonetheless it is sub-optimum because quantization in each module (“leafs”) is carried out independently of the energies/level of other audio channels that may be high in relative level and, therefore, produce masking.
This is possible, since “leaf” modules are not aware of the global level distribution at a higher tree level (e.g. the “root” module). Each leaf has its own corresponding IID/ICLD parameter, which indicates the energy distribution from its input toward output channels. For example, the IID/ICLD parameter of leaf “r3” (processed by the first two-channel decoder 122) may indicate that 90% of the incoming energy should be sent to leaf r2, while the remaining energy (10%) should be sent to leaf r4. This process is repeated for each leaf in the tree. Since each energy distribution parameter is represented with limited accuracy, the deviation between the desired and the actual energy of each output channel A to E depends on the quantization errors in the IID/ICLD parameters, as well as on the energy distribution (and hence propagation of quantization errors). In other words, as the same quantization table is used for a certain parameter type, e.g. ICC or IID, within all parameterization stages r1 to r4, the IID/ICLD quantization is performed optimal only locally. This means that for each parameterization stage r1 to r4, the error in output energy of the (local) output channels is maximum for the weakest output channel in prior art implementations.
As detailed in the previous paragraphs, the quantization of level parameters (IID or ICLD) or other parameters such as ICC, phase differences or time differences describing the spatial perception of a multi-channel audio signal is still sub-optimal, since bandwidth may be wasted for spatial parameters describing channels that are mainly masked due to low energy within the channel.