In recent times, the multi-channel audio reproduction technique is becoming more and more important. This may be due to the fact that audio compression/encoding techniques such as the well-known mp3 technique have made it possible to distribute audio records via the Internet or other transmission channels having a limited bandwidth. The mp3 coding technique has become so famous because of the fact that it allows distribution of all the records in a stereo format, i.e., a digital representation of the audio record including a first or left stereo channel and a second or right stereo channel.
Nevertheless, there are basic shortcomings of conventional two-channel sound systems. Therefore, the surround technique has been developed. A recommended multi-channel-surround presentation format includes, in addition to two stereo channels L and R, an additional center channel C and two surround channels Ls, Rs. This reference sound format is also referred to as three/two-stereo, which means three front channels and two surround channels. In a playback environment, at least five speakers at five appropriate locations are needed to get an optimum sweet spot in a certain distance of the five well-placed loudspeakers.
Recent approaches for the parametric coding of multi-channel audio signals (parametric stereo (PS), “spatial audio coding”, “binaural cue coding” (BCC) etc.) represent a multi-channel audio signal by means of a downmix signal (could be monophonic or comprise several channels) and parametric side information (“spatial cues”), characterizing its perceived spatial sound stage. The different approaches and techniques shall be reviewed shortly in the following paragraphs.
A related technique, also known as parametric stereo, is described in J. Breebaart, S. van de Par, A. Kohlrausch, E. Schuijers, “High-Quality Parametric Spatial Audio Coding at Low Bitrates”, AES 116th Convention, Berlin, Preprint 6072, May 2004, and E. Schuijers, J. Breebaart, H. Purnhagen, J. Engdegard, “Low Complexity Parametric Stereo Coding”, AES 116th Convention, Berlin, Preprint 6073, May 2004.
Several techniques are known in the art for reducing the amount of data required for transmission of a multi-channel audio signal. To this end, reference is made to FIG. 11, which shows a joint stereo device 60. This device can be a device implementing e.g. intensity stereo (IS) or binaural cue coding (BCC). Such a device generally receives—as an input—at least two channels (CH1, CH2, . . . CHn), and outputs a single carrier channel and parametric data. The parametric data are defined such that, in a decoder, an approximation of an original channel (CH1, CH2, . . . CHn) can be calculated.
Normally, the carrier channel will include subband samples, spectral coefficients, time domain samples etc., which provide a comparatively fine representation of the underlying signal, while the parametric data does not include such samples of spectral coefficients but include control parameters for controlling a certain reconstruction algorithm such as weighting by multiplication, time shifting, frequency shifting, phase shifting, etc. The parametric data, therefore, includes only a comparatively coarse representation of the signal or the associated channel. Stated in numbers, the amount of data required by a carrier channel can be in the range of 60-70 kbit/s in an MPEG coding scheme, while the amount of data required by parametric side information for one channel may be in the range of about 10 kbit/s for a 5.1 channel signal. An example for parametric data are the well-known scale factors, intensity stereo information or binaural cue parameters as will be described below.
The BCC Technique is for example described in the AES convention paper 5574, “Binaural Cue Coding applied to Stereo and Multi-Channel Audio Compression”, C. Faller, F. Baumgarte, May 2002, Munich, in the IEEE WASPAA Paper “Efficient representation of spatial audio using perceptual parametrization”, October 2001, Mohonk, N.Y., and in the 2 ICASSP Papers “Estimation of auditory spatial cues for binaural cue coding”, and “Binaural cue coding: a novel and efficient representation of spatial audio”, both authored by C. Faller, and F. Baumgarte, Orlando, Fla., May 2002.
In BCC encoding, a number of audio input channels are converted to a spectral representation using a DFT (Discrete Fourier Transform) based transform with overlapping windows. The resulting spectrum is divided into non-overlapping partitions. Each partition has a bandwidth proportional to the equivalent rectangular bandwidth (ERB). The inter-channel level differences (ICLD) and the inter-channel time differences (ICTD) are estimated for each partition. The inter-channel level differences ICLD and inter-channel time differences ICTD are normally given for each channel with respect to a reference channel and furthermore quantized. The transmitted parameters are finally calculated in accordance with prescribed formulae (encoded), which may depend on the specific partitions of the signal to be processed.
At a decoder-side, the decoder receives a mono signal and the BCC bit stream. The mono signal is transformed into the frequency domain and input into a spatial synthesis block, which also receives decoded ICLD and ICTD values. In the spatial synthesis block, the BCC parameters (ICLD and ICTD) values are used to perform a weighting operation of the mono signal in order to synthesize the multi-channel signals, which, after a frequency/time conversion, represent a reconstruction of the original multi-channel audio signal.
In case of BCC, the joint stereo module 60 is operative to output the channel side information such that the parametric channel data are quantized and encoded resulting in ICLD or ICTD parameters, wherein one of the original channels is used as the reference channel while coding the channel side information.
Normally, the carrier channel is formed of the sum of the participating original channels.
Therefore, the above techniques additionally provide a suitable mono representation for playback equipment that can only process the carrier channel and is not able to process the parametric data for generating one or more approximations of more than one input channel.
The audio coding technique known as binaural cue coding (BCC) is also well described in the United States patent application publications US 2003, 0219130 A1, 2003/0026441 A1 and 2003/0035553 A1. Additional reference is also made to “Binaural Cue Coding. Part II: Schemes and Applications”, C. Faller and F. Baumgarte, IEEE Trans. on Audio and Speech Proc., Vol. 11, No. 6, November 2003 and to “Binaural cue coding applied to audio compression with flexible rendering”, C. Faller and F. Baumgarte, AES 113th Convention, Los Angeles, October 2002. The cited United States patent application publications and the two cited technical publications on the BCC technique authored by Faller and Baumgarte are incorporated herein by reference in their entireties.
Although ICLD and ICTD parameters represent the most important sound source localization parameters, a spatial representation using these parameters only limits the maximum quality that can be achieved. To overcome this limitation, and hence to enable high-quality parametric coding, Parametric stereo (as described in J. Breebaart, S. van de Par, A. Kohlrausch, E. Schuijers (2005) “Parametric coding of stereo audio”, Eurasip J. Applied Signal Proc. 9, 1305-1322) applies three types of spatial parameters, referred to as Interchannel Intensity Differences (IIDs), Interchannel Phase Differences (IPDs), and Interchannel Coherence (IC). The extension of the spatial parameter set with coherence parameters enables a parameterization of the perceived spatial ‘diffuseness’ or spatial ‘compactness’ of the sound stage.
In the following, a typical generic BCC scheme for multi-channel audio coding is elaborated in more detail with reference to FIGS. 12 to 14. FIG. 9 shows such a generic binaural cue coding scheme for coding/transmission of multi-channel audio signals. The multi-channel audio input signal at an input 110 of a BCC encoder 112 is downmixed in a downmix block 114. In the present example, the original multi-channel signal at the input 110 is a 5-channel surround signal having a front left channel, a front right channel, a left surround channel, a right surround channel and a center channel. In a preferred embodiment of the present invention, the downmix block 114 produces a sum signal by a simple addition of these five channels into a mono signal. Other downmixing schemes are known in the art such that, using a multi-channel input signal, a downmix signal having a single channel can be obtained. This single channel is output at a sum signal line 115. A side information obtained by a BCC analysis block 116 is output at a side information line 117. In the BCC analysis block, inter-channel level differences (ICLD), and inter-channel time differences (ICTD) are calculated as has been outlined above. The BCC analysis block 116 is formed to also calculate inter-channel correlation values (ICC values). The sum signal and the side information is transmitted, preferably in a quantized and encoded form, to a BCC decoder 120. The BCC decoder decomposes the transmitted sum signal into a number of subbands and applies scaling, delays and other processing to generate the subbands of the output multi-channel audio signals. This processing is performed such that ICLD, ICTD and ICC parameters (cues) of a reconstructed multi-channel signal at an output 121 are similar to the respective cues for the original multi-channel signal at the input 110 of the BCC encoder 112. To this end, the BCC decoder 120 includes a BCC synthesis block 122 and a side information processing block 123.
In the following, the internal construction of the BCC synthesis block 122 is explained with reference to FIG. 13. The sum signal on line 115 is input into a time/frequency conversion unit or filter bank FB 125. At the output of block 125, a number N of sub band signals are present, or, in an extreme case, a block of spectral coefficients, when the audio filter bank 125 performs a 1:1 transform, i.e., a transform which produces N spectral coefficients from N time domain samples (critical subsampling).
The BCC synthesis block 122 further comprises a delay stage 126, a level modification stage 127, a correlation processing stage 128 and an inverse filter bank stage IFB 129. At the output of stage 129, the reconstructed multi-channel audio signal having for example five channels in case of a 5-channel surround system, can be output to a set of loudspeakers 124 as illustrated in FIG. 12.
As shown in FIG. 13, the input signal s(n) is converted into the frequency domain or filter bank domain by means of element 125. The signal output by element 125 is multiplied such that several versions of the same signal are obtained as illustrated by branching node 130. The number of versions of the original signal is equal to the number of output channels in the output signal to be reconstructed. When, in general, each version of the original signal at node 130 is subjected to a certain delay d1, d2, . . . , di, . . . , dN. The delay parameters are computed by the side information processing block 123 in FIG. 12 and are derived from the inter-channel time differences as determined by the BCC analysis block 116.
The same is true for the multiplication parameters a1, a2, . . . , ai, . . . , aN, which are also calculated by the side information processing block 123 based on the inter-channel level differences as calculated by the BCC analysis block 116.
The ICC parameters calculated by the BCC analysis block 116 are used for controlling the functionality of block 128 such that certain correlations between the delayed and level-manipulated signals are obtained at the outputs of block 128. It is to be noted here that the ordering of the stages 126, 127, 128 may be different from the case shown in FIG. 13.
One should be aware that, in a frame-wise processing of an audio signal, the BCC analysis is also performed frame-wise, i.e. time-varying, and also frequency-wise. This means that, for each spectral band, the BCC parameters are obtained individually. This further means that, in case the audio filter bank 125 decomposes the input signal into for example 32 band pass signals, the BCC analysis block obtains a set of BCC parameters for each of the 32 bands. Naturally the BCC synthesis block 122 from FIG. 12, which is shown in detail in FIG. 13, performs a reconstruction, which is also based on the 32 bands in the example.
In the following, reference is made to FIG. 14 showing a setup to determine certain BCC parameters. Normally, ICLD, ICTD and ICC parameters can be defined between arbitrary pairs of channels. One method, that will be outlined here, consists of ICLD and ICTD parameters between a reference channel and each other channel. This is illustrated in FIG. 14A.
ICC parameters can be defined in different ways. Most generally, one could estimate ICC parameters in the encoder between all possible channel pairs as indicated in FIG. 14B. In this case, a decoder would synthesize ICC such that it is approximately the same as in the original multi-channel signal between all possible channel pairs. It was, however, proposed to estimate only ICC parameters between the strongest two channels at a time. This scheme is illustrated in FIG. 14C, where an example is shown, in which at one time instance, an ICC parameter is estimated between channels 1 and 2, and, at another time instance, an ICC parameter is calculated between channels 1 and 5. The decoder then synthesizes the inter-channel correlation between the strongest channels in the decoder and applies some heuristic rule for computing and synthesizing the inter-channel coherence for the remaining channel pairs.
Regarding the calculation of, for example, the multiplication parameters a1, . . . , aN based on transmitted ICLD parameters, reference is made to AES convention paper 5574 cited above. The ICLD parameters represent an energy distribution in an original multi-channel signal. Without loss of generality, it is shown in FIG. 14A that there are four ICLD parameters showing the energy difference between all other channels and the front left channel. In the side information processing block 123, the multiplication parameters a1, . . . , aN are derived from the ICLD parameters such that the total energy of all reconstructed output channels is the same as (or proportional to) the energy of the transmitted sum signal. A simple way for determining these parameters is a 2-stage process, in which, in a first stage, the multiplication factor for the left front channel is set to unity, while multiplication factors for the other channels in FIG. 14A are determined from the transmitted ICLD values. Then, in a second stage, the energy of all five channels is calculated and compared to the energy of the transmitted sum signal. Then, all channels are downscaled using a downscaling factor which is equal for all channels, wherein the downscaling factor is selected such that the total energy of all reconstructed output channels is, after downscaling, equal to the total energy of the transmitted sum signal.
Naturally, there are also other methods for calculating the multiplication factors, which do not rely on the 2-stage process but which only need a 1-stage process.
Regarding the delay parameters, it is to be noted that the delay parameters ICTD, which are transmitted from a BCC encoder can be used directly, when the delay parameter d1 for the left front channel is set to zero. No resealing has to be done here, since a delay does not alter the energy of the signal.
As has been outlined above with respect to FIG. 14, the parametric side information, i.e., the interchannel level differences (ICLD), the interchannel time differences (ICTD) or the interchannel coherence parameter (ICC) can be calculated and transmitted for each of the five channels. This means that one, normally, transmits four sets of interchannel level differences for a five channel signal. The same is true for the interchannel time differences. With respect to the interchannel coherence parameter, it can also be sufficient to only transmit for example two sets of these parameters.
As has been outlined above with respect to FIG. 13, there is not a single level difference parameter, time difference parameter or coherence parameter for one frame or time portion of a signal. Instead, these parameters are determined for several different frequency bands so that a frequency-dependent parametrization is obtained. Since it is preferred to use for example 32 frequency channels, i.e., a filter bank having 32 frequency bands for BCC analysis and BCC synthesis, the parameters can occupy quite a lot of data. Although—compared to other multi-channel transmissions—the parametric representation results in a quite low data rate, there is a continuing need for further reduction of the necessary data rate to represent a signal having more than two channels such as a multi-channel surround signal.
The encoding of a multi-channel audio signal can be advantageously implemented using several existing modules, which perform a parametric stereo coding into a single mono-channel. The international patent application WO2004008805 A1 teaches how parametric stereo coders can be ordered in a hierarchical set-up such, that a given number of input audio channels are subsequently downmixed into one single mono-channel. The parametric side information, describing the spatial properties of the downmix mono-channel, finally consists of all the parametric information subsequently produced during the iterative downmixing process. This means, that, if there are, for example, three stereo-to-mono downmixing processes involved in building the final mono signal, the final set of parameters building the parametric representation of the multi-channel audio signal consists of the three sets of the parameters derived during every single stereo-to-mono downmixing process.
A hierarchical downmixing encoder is shown in FIG. 15, to explain the method of the prior art in more detail. FIG. 15 shows six original audio channels 200a to 200f that are transformed into a single monophonic audio channel 202 plus parametric side information. Therefore, the six original audio channels 200a to 200f have to be transformed from the time domain into the frequency domain, which is performed by transforming units 204, transforming the audio channels 200a to 200f into the corresponding channels 206a to 206f in the frequency domain. Following the hierarchical approach, the channels 206a to 206f are pair-wise downmixed into three monophonic channels L, R and C (208a, 208b and 208c, respectively). During the downmixing of the three pairs of channels a parameter set is derived for each channel pair, describing the spatial properties of the original stereophonic signal, downmixed into a monophonic signal. Thus, in this first downmixing step, three parameter sets 210a to 210c are generated to preserve the spatial information of the signals 206a to 206f. 
In the next step of the hierarchical downmixing, channels 208a and 208b are downmixed into a channel 212 (LR), generating a parameter set 210d (parameter set 4. To finally derive only one single monophonic channel, a downmixing of the channels 208c and 212 is necessary, resulting in channel 214 (M). This generates a fifth parameter set 210e (parameter set 5). Finally, the downmixed monophonic audio signal 214 is inversely transformed into the time domain to derive an audio signal 202 that can be played by standard equipment.
As described above, a parametric representation of the downmix audio signal 202 according to the prior art consists of all the parameter sets 210a to 210e, which means that if one wants to rebuild the original multi-channel audio signal (channels 200a to 200f) from the monophonic audio signal 202, all the parameter sets 210a to 210e are required as side information of the monophonic downmix signal 202.
The U.S. patent application Ser. No. 11/032,689 (from here only referred to as “prior art cue combination”) describes a process for combining several cue values into a single transmitted one in order to save side information in a nonhierarchical coding scheme. To do so, all the channels are downmixed first and the cue codes are later on combined to form transmitted cue values (could also be one single value), the combination being dependent on a predefined mathematical function, in which the spatial parameters, that are derived directly from the input signals, are put in as variables.
State-of-the-art techniques for the parametric coding of two (“stereo”) or more (“multi-channel”) audio input channels derive the spatial parameters directly from the input signals. Examples of such parameters are inter-channel level differences (ICLD) or inter-channel intensity differences (IID), inter-channel time delay (ICTD) or inter-channel phase differences (IPD), and inter-channel correlation/coherence (ICC), each of which are transmitted in a frequency-selective fashion, i.e. per frequency band. The application of the prior art cue combination teaches that several cue values can be combined to a single value that is transmitted from the encoder to the decoder side. The decoding process uses the transmitted single value instead of the originally individually transmitted cue values to reconstruct the multi-channel output signal. In a preferred embodiment, this scheme has been applied to the ICC parameters. It has been shown that this leads to a considerable reduction in the size of the cue side information while preserving the spatial quality of the vast majority of signals. It is, however, not clear how this can be exploited in a hierarchical coding scheme.
The patent application on prior art cue combination has detailed the principle of the invention by an example for a system based on two transmitted downmix channels. In the proposed method, with reference to FIG. 15, ICC values of Lf/Lr and Rf/Rr channel pairs are combined into a single transmitted ICC parameter. The two combined ICC values have been obtained during the downmixing of a front-left channel Lf and a rear-left channel Lr into the channel L and during the downmixing of a front-right Rf and a rear-right channel Rr into the channel R. Therefore, the two combined ICC values that are finally being combined into the single transmitted ICC parameter, both carry information about the front/back correlation of the original channels and a combination of these two ICC values will generally preserve most of this information. If one would have to further downmix the L and R channels into one single mono channel, one would get a third ICC value, carrying information about the left/right correlation of the downmix channels L and R. According to the cue combination of prior art, one would now have to combine the three ICC values applying a given function transforming the three ICC values into one transmitted ICC parameter.
One has the problem then that front/back information mixes with left/right information, which is obviously disadvantageous for a reproduction of the original multi-channel audio signal. In the U.S. application Ser. No. 11/032,689, this is avoided by transmitting two downmix channels, the L and R channels, that hold the left/right information, and additionally transmitting one single ICC value, holding front/back information. This preserves the spatial properties of the original channels at the cost of a substantially increased data rate, resulting from the full additional downmix channel to be transmitted.