In addition to the two stereo channels, a recommended multi-channel surround representation includes a center channel C and two surround channels, i.e. the left surround channel Ls and the right surround channel Rs, and additionally, if applicable, a subwoofer channel also referred to as LFE channel (LFE=Low Frequency Enhancement). This reference sound format is also referred to as 3/2 (plus LFE) stereo and recently also as 5.1 multi-channel, which means that there are three front channels and two surround channels. In general, five or six transmission channels are required. In a reproduction environment, at least five loudspeakers are required in the respective five different positions to obtain an optimal so-called sweet spot a determined distance from the five correctly placed loudspeakers. However, with respect to its positioning, the subwoofer is usable in a relatively free way.
There are several techniques for reducing the amount of data required to transmit a multi-channel audio signal. Such techniques are also called joint stereo techniques. For this purpose, reference is made to FIG. 5. FIG. 5 shows a joint stereo device 60. This device may be a device implementing, for example, the intensity stereo technique (IS technique) or the binaural cue coding technique (BCC technique). Such a device generally receives at least two channels (CH1, CH2, . . . CHn) as input signal and outputs at least one single carrier channel (downmix) and parametric data, i.e. one or more parameter sets. The parametric data are defined so that an approximation of each original channel (CH1, CH2, . . . CHn) may be calculated in a decoder.
Normally, the carrier channel will include subband samples, spectral coefficients or time domain samples, etc., which provide a comparatively fine representation of the underlying signal, while the parametric data and/or parameter sets do not include any such samples or spectral coefficients. Instead, the parametric data include control parameters for controlling a determined reconstruction algorithm, such as weighting by multiplication, time shifting, frequency shifting, . . . . The parametric data thus include only a comparatively rough representation of the signal or the associated channel. Expressed in numbers, the amount of data required by a carrier channel (which is compressed, i.e. coded by means of AAC, for example) is in the range of 60 to 70 kbit/s, while the amount of data required by parametric side information is in the order from 1.5 kbit/s for a channel. One example for parametric data are the known scaling factors, intensity stereo information or binaural cue parameters, as will be described below.
The intensity stereo coding technique is described in the AES preprint 3799 entitled “Intensity stereo coding” J. Herre, K. H. Brandenburg, D. Lederer, February 1994, Amsterdam. In general, the concept of intensity stereo is based on a main axis transform which is to be applied to data of the two stereophonic audio channels. If most data points are placed around the first main axis, a coding gain may be achieved by rotating both signals by a determined angle prior to the coding. However, this does not always apply to real stereophonic reproduction techniques. The reconstructed signals for the left and right channels consist of differently weighted or scaled versions of the same transmitted signal. Nevertheless, the reconstructed signals differ in amplitude, but they are identical with respect to their phase information. The energy time envelopes of both original audio channels, however, are maintained by means of the selective scaling operation typically operating in frequency-selective fashion. This corresponds to the human sound perception at high frequencies where the dominant spatial cues are determined by the energy envelopes.
In addition, in practical implementations the transmitted signal, i.e. the carrier channel, is formed of the sum signal of the left channel and the right channel instead of rotating both components. Furthermore, this processing, i.e. the generation of the intensity stereo parameters for performing the scaling operation, is performed in a frequency-selective way, i.e. independently of each other for each scale factor band, i.e. for each encoder frequency partition. Preferably, both channels are combined to form a combined or “carrier” channel. In addition to the combined channel, the intensity stereo information is determined which depends on the energy of the first channel, the energy of the second channel and the energy of the combined or sum channel.
The BCC technique is described in the AES convention paper 5574 entitled “Binaural cue coding applied to stereo and multi-channel audio compression”, C. Faller, F. Baumgarte, May 2002, München. In BCC coding, a number of audio input channels is converted to a spectral representation using a DFT-based transform with overlapping windows. The resulting spectrum is divided into non-overlapping partitions. Each partition has a bandwidth proportional to an equivalent right-angled bandwidth (ERB). So-called inter-channel level differences (ICLD) as well as so-called inter-channel time differences (ICTD) are calculated for each partition, i.e. for each band and for each frame k, i.e. a block of time samples. The ICLD and ICDT parameters are quantized and coded to obtain a BCC bit stream. The inter-channel level differences and the inter-channel time differences are given for each channel with respect to a reference channel. In particular, the parameters are calculated according to predetermined formulae depending on the particular divisions of the signal to be processed.
On the decoder side, the decoder receives a mono signal and the BCC bit stream, i.e. a first parameter set for the inter-channel time differences and a second parameter set for the inter-channel level differences per frame. The mono signal is transformed to the frequency domain and input into a synthesis block also receiving decoded ICLD and ICTD values. In the synthesis block or reconstruction block, the BCC parameters (ICLD and ICTD) are used to perform a weighting operation of the mono signal to reconstruct the multi-channel signal, which then, after a frequency/time conversion, represents a reconstruction of the original multi-channel audio signal.
In the case of BCC, the joint stereo module 60 operates to output the channel side information so that the parametric channel data are quantized and coded ICLD and ICTD parameters, wherein one of the original channels may be used as reference channel for coding the channel side information. Normally, the carrier channel is formed of the sum of the participating original channels.
Of course, the above technique only provides a mono representation for a decoder which is only able to decode the carrier channel, but which is not capable of generating the parameter data for generating one or more approximations of more than one input channel.
The audio coding technique referred to as BCC technique is further described in the US patent applications US 2003/0219130 A1, 2003/0026441 A1 and 2003/0035553 A1. In addition, further see “Binaural Cue Coding. Part. II: Schemes and Applications”, C. Faller and F. Baumgarte, IEEE: Transactions on Audio and Speech Proc., Vol. 11, No. 6, November 1993. Further, also see C. Faller and F. Baumgarte “Binaural Cue Coding applied to Stereo and Multi-Channel Audio compression”, Preprint, 112th Convention of the Audio Engineering Society (AES), May 2002, and J. Herre, C. Faller, C. Ertel, J. Hilpert, A. Hoelzer, C. Spenger “MP3 Surround: Efficient and Compatible Coding of Multi-Channel Audio”, 116th AES Convention, Berlin, 2004, Preprint 6049. In the following, there will be represented a typical general BCC scheme for multi-channel audio coding in more detail with respect to FIGS. 6 to 8. FIG. 6 shows a general BCC coding scheme for coding/transmission of multi-channel audio signals. The multi-channel audio input signal is input at an input 110 of a BCC encoder 112 and is “mixed down” in a so-called downmix block 114, i.e. converted to a single sum channel. In the present example, the signal at the input 110 is a 5-channel surround signal having a front left channel and a front right channel, a left surround channel and a right surround channel, and a center channel. Typically, the downmix block generates a sum signal by simple addition of these five channels into a mono signal. Other downmix schemes are known in the art, all resulting in generating, using a multi-channel input signal, a downmix signal having a single channel or having a number of downmix channels which, in any case, is less than the number of original input channels. In the present example, a downmix operation would already be achieved if four carrier channels were generated from the five input channels. The single output channel and/or the number of output channels is output on a sum signal line 115.
Side information obtained by a BCC analysis block 116 are output on a side information line 117. In the BCC analysis block, inter-channel level differences (ICLD), inter-channel time differences (ICTD) or inter-channel correlation values (ICC values) may be calculated. Thus, there are three different parameter sets, namely the inter-channel level differences (ICLD), the inter-channel time differences (ICTD) and the inter-channel correlation values (ICC), for the reconstruction in the BCC synthesis block 122.
The sum signal and the side information with the parameter sets are typically transmitted to a BCC decoder 120 in a quantized and coded format. The BCC decoder splits the transmitted (and decoded, in the case of a coded transmission) sum signal into a number of subbands and performs scalings, delays and further processing to generate the subbands of the several channels to be reconstructed. This processing is performed so that the ICLD, ICTD and ICC parameters (cues) of a reconstructed multi-channel signal at output 121 are similar to the respective cues for the original multi-channel signal at input 110 into the BCC encoder 112. For this purpose, the BCC decoder 120 includes a BCC synthesis block 122 and a side information processing block 123.
The following will illustrate the internal structure of the BCC synthesis block 122 with respect to FIG. 7. The sum signal on the line 115 is input into a time/frequency conversion block typically embodied as filter bank FB 125. At the output of block 125, there is a number N of subband signals or, in an extreme case, a block of spectral coefficients, if the audio filter bank 125 performs a transform generating N spectral coefficients from N time domain samples.
The BCC synthesis block 122 further includes a delay stage 126, a level modification stage 127, a correlation processing stage 128 and a stage IFB 129 representing an inverse filter bank. At the output of the stage 129, the reconstructed multi-channel audio signal having, for example, five channels in the case of a 5-channel surround system may be output on a set of loudspeakers 124, as illustrated in FIG. 6.
FIG. 7 further illustrates that the input signal s(n) is converted to the frequency domain or filter bank domain by means of element 125. The signal output by element 125 is multiplied so that several versions of the same signal are obtained, as indicated by node 130. The number of versions of the original signal is equal to the number of output channels in the output signal to be reconstructed. If each version of the original signal is subjected to a determined delay d1, d2, . . . di, dN at the node 130, the result is the situation at the output of blocks 126, which includes the versions of the same signal, but with different delays. The delay parameters are calculated by the side information processing block 123 in FIG. 6 and derived from the inter-channel time differences as they were determined by the BCC analysis block 116.
The same applies to the multiplication parameters a1, a2 . . . ai, aN, which are also calculated by the side information processing block 123 based on the inter-channel level differences determined by the BCC analysis block 116.
The ICC parameters are calculated by the BCC analysis block 116 and used for controlling the functionality of the block 128 so that determined correlation values between the delayed and level-manipulated signals are obtained at the output of block 128. It is to be noted that the order of the stages 126, 127, 128 may be different from that represented in FIG. 7.
It is further to be noted that, in a blockwise processing of the audio signal, the BCC analysis is also performed blockwise. Furthermore, the BCC analysis is also performed frequency-wise, i.e. in a frequency-selective way. This means that, for each spectral band, there is an ICLD parameter, an ICTD parameter and an ICC parameter for each block. The ICTD parameters for at least one block for at least one channel across all bands thus represent the ICTD parameter set. The same applies to the ICLD parameter set representing all ICLD parameters for at least one block for all frequency bands for the reconstruction of at least one output channel. The same applies, in turn, to the ICC parameter set which again includes several individual ICC parameters for at least one block for various bands for the reconstruction of at least one output channel on the basis of the input channel or sum channel.
In the following, reference is made to FIG. 8 showing a situation from which the determination of BCC parameters may be seen. Normally, the ICLD, ICTD and ICC parameters may be defined between any channel pairs. Typically a determination of the ICLD and the ICTD parameters is performed between a reference channel and each other input channel, so that there is a distinct parameter set for each of the input channels except the reference channel. This is also illustrated in FIG. 8A.
However, the ICC parameters may be defined differently. In general, ICC parameters may be generated in the encoder between any channel pairs, as also illustrated schematically in FIG. 8B. In this case, a decoder would perform an ICC synthesis so that approximately the same result is obtained as it was present in the original signal between any channel pairs. However, there has been the suggestion to calculate only ICC parameters between the two strongest channels at any time, i.e. for each time frame. This scheme is represented in FIG. 8C, which shows an example in which, at one time, an ICC parameter between the channels 1 and 2 is calculated and transmitted, and in which, at another time, an ICC parameter between the channels 1 and 5 is calculated. The decoder then synthesizes the inter-channel correlation between the two strongest channels in the decoder and executes further typically heuristic rules for synthesizing the inter-channel coherence for the remaining channel pairs.
With respect to the calculation of, for example, the multiplication parameters a1, . . . aN based on the transmitted ICLD parameters, reference is made to the cited AES convention paper 5574. The ICLD parameters represent an energy distribution in an original multi-channel signal. Without loss of generality, FIG. 8A shows that there are four ICLD parameters representing the energy difference between all other channels and the front left channel. In the side information processing block 123, the multiplication parameters a1, . . . aN are derived from the ICLD parameters so that the total energy of all reconstructed output channels is the same energy as present for the transmitted sum signal or is at least proportional to this energy. One way to determine these parameters is a two-stage process in which, in a first stage, the multiplication factor for the left front channel is set to 1, while multiplication factors for the other channels in FIG. 8C are set to the transmitted ICLD values. Then, in a second stage, the energy of all five channels is calculated and compared to the energy of the transmitted sum signal. Then, all channels are downscaled, namely using a scaling factor which is equal for all channels, wherein the scaling factor is selected so that the total energy of all reconstructed output channels after the scaling is equal to the total energy of the transmitted sum signal and/or the transmitted sum signals.
With respect to the inter-channel coherence measure ICC transmitted from the BCC encoder to the BCC decoder as further parameter set, it is to be noted that a coherence manipulation could be performed by modification of the multiplication factors, such as by multiplying the weighting factors of all subbands by random numbers having values between 20 log 10−6 and 20 log 106. The pseudo random sequence is typically selected so that the variance for all critical bands is approximately equal and that the average value within each critical band is zero. The same sequence is used for the spectral coefficients of each different frame or block. Thus, the width of the audio scene is controlled by modifications of the variances of the pseudo random sequence. A larger variance generates a larger hearing width. The variance modification may be performed in individual bands having a width of a critical band. This allows the simultaneous existence of several objects in a hearing scene, wherein each object has a different hearing width. A suitable amplitude distribution for the pseudo random sequence is a uniform distribution on a logarithmic scale, such as represented in the US patent publication 2002/0219130 A1.
In order to transmit the five channels in a compatible way, for example in a bit stream format which is also suitable for a normal stereo decoder, there may be used the so-called matrixing technique described in “MUSICAM Surround: A universal multi-channel coding system compatible with ISO/IEC 11172-3”, G. Theile and G. Stoll, AES Preprint, October 1992, San Francisco.
Furthermore, see further multi-channel coding techniques described in the publication “Improved MPEG 2 Audio multi-channel encoding”, B. Grill, J. Herre, K. H. Brandenburg, E. Eberlein, J. Koller, J. Miller, AES Preprint 3865, February 1994, Amsterdam, wherein a compatibility matrix is used to obtain the downmix channels from the original input channels.
In summary, you can say that the BCC technique allows an efficient and also backward-compatible coding of multi-channel audio material, as also described, for example, in the specialist publication by E. Schuijer, J. Breebaart, H. Purnhagen, J. Engdegård entitled “Low-Complexity Parametric Stereo Coding”, 119th AES Convention, Berlin, 2004, Preprint 6073. In this context, mention should also be made of the MPEG-4 standard and particularly the expansion to parametric audio techniques, wherein this standard part is also known by the designation ISO/IEC 14496-3: 2001/FDAM 2 (Parametric Audio). In this respect, there should be mentioned, in particular, the syntax in table 8.9 of the MPEG-4 standard entitled “syntax of the ps_data( )”. In this example, we should mention the syntax elements “enable_icc” and “enable_ipdopd”, wherein these syntax elements are used to turn on and off a transmission of an ICC parameter and a phase corresponding to inter-channel time differences. There should further be mentioned the syntax elements “icc_data( )” “ipd_data( )” and “opd_data( )”.
In summary, it is to be noted that generally such parametric multi-channel techniques are used employing one or several transmitted carrier channels, wherein M transmitted channels are formed from N original channels to reconstruct again the N output channels or a number K of output channels, wherein K is equal to or less than the number of original channels N.
As can be seen from FIG. 6, the BCC analysis is a typical separate preprocessing to generate parameter data on the one hand and one or more transmission channels (downmix channels) on the other hand from a multi-channel signal having N original channels. Typically, these downmix channels are then compressed for example by means of a typical MP3 or AAC stereo/mono encoder, although this is not shown in FIG. 6, so that, on the output side, there is a bit stream representing the transmission channel data in compressed form and that there is further another bit stream representing the parameter data. The BCC analysis thus occurs separately from the actual audio coding of the downmix channels and/or the sum signal 115 of FIG. 6.
The decoder side is similar. A decoder having multi-channel ability will first decode the bit stream including the compressed downmix signal depending on the used coding algorithm and again provide one or more transmission channels on the output side, i.e. typically as a time sequence of PCM data (PCM=Pulse Code Modulation). Then, the BCC synthesis will take place as a distinct separate and isolated postprocessing which signals self-sufficiently with the parameter data stream and is provided with data to generate, on the output side, several output channels preferably equal to the number of the original input channels from the audio-decoded downmix signal.
Thus, it is an advantage of the BCC analysis that it has a distinct filter bank for the purposes of the BCC analysis and a distinct filter bank for the purposes of the BCC synthesis, for example, so that it is separate from the filter bank of the audio encoder/decoder in order not to have to make any compromises regarding audio compression on the one hand and multi-channel reconstruction on the other hand. Generally speaking, the audio compression is thus done separately from the multi-channel parameter processing to be optimally equipped for both fields of application.
However, this concept has the disadvantage that a complete signaling has to be transmitted both for the multi-channel reconstruction and for the audio decoding. This is particularly disadvantageous when, as will typically be the case, both the audio decoder and the multi-channel reconstruction means perform the same or similar steps and thus require the same and/or mutually dependent configuration settings. Due to the completely separate concept, signaling data are thus transmitted twice resulting in an artificial “expansion” of the data amount, which is ultimately due to the fact that one has chosen the separate concept between audio coding/decoding and multi-channel analysis/synthesis.
On the other hand, a complete “linking” of the multi-channel reconstruction to the audio decoding would considerably restrict the flexibility, because in that case the actually important goal of the separation of both processing steps to be able to perform each processing step in an optimal way would have to be given up. Thus, considerable quality losses would arise, in particular in the case of several successive coding/decoding stages also referred to as “tandem” coding. If there is a complete linking of the BCC data to the coded audio data, a multi-channel reconstruction has to be performed with each decoding to perform a multi-channel synthesis again when recoding. Since it is the nature of every parametric technique that it is lossy, losses will accumulate by repeated analysis synthesis analysis so that, with each encoder/decoder stage, the perceptible quality of the audio signal further decreases.
In this case, decoding/encoding of audio data without simultaneous analysis/synthesis processing of the parameter data would only be possible if each audio codec in the tandem chain worked identically, i.e. had the same sampling rate, block length, advance length, windowing, transform, . . . , i.e. had generally the same configuration, and if, in addition, the respective block boundaries also were maintained. Such a concept, however, would considerably restrict the flexibility of the whole concept. Particularly regarding the fact that the parametric multi-channel techniques are intended to supplement already existing stereo data, for example, by additional parameter data, this limitation is all the more painful. Since the already existing stereo data may originate from many different encoders that all use different block lengths or that do not even operate in the frequency domain, but in the time domain etc., such a limitation would take the concept of the later supplementation ad absurdum from the beginning.