In recent times, the multi-channel audio reproduction technique is becoming more and more important. This may be due to the fact that audio compression/encoding techniques such as the well-known mp3 technique have made it possible to distribute audio records via the Internet or other transmission channels having a limited bandwidth. The mp3 coding technique has become so famous because of the fact that it allows distribution of all the records in a stereo format, i.e., a digital representation of the audio record including a first or left stereo channel and a second or right stereo channel.
Nevertheless, there are basic shortcomings of conventional two-channel sound systems. Therefore, the surround technique has been developed. A recommended multi-channel-surround representation includes, in addition to the two stereo channels L and R, an additional center channel C and two surround channels Ls, Rs. This reference sound format is also referred to as three/two-stereo, which means three front channels and two surround channels. Generally, five transmission channels are required. In a playback environment, at least five speakers at the respective five different places are needed to get an optimum sweet spot in a certain distance from the five well-placed loudspeakers.
Several techniques are known in the art for reducing the amount of data required for transmission of a multi-channel audio signal. Such techniques are called joint stereo techniques. To this end, reference is made to FIG. 10, which shows a joint stereo device 60. This device can be a device implementing e.g. intensity stereo (IS) or binaural cue coding (ECC). Such a device generally receives—as an input—at least two channels (CH1, CH2, . . . CHn), and outputs a single carrier channel and parametric data. The parametric data are defined such that, in a decoder, an approximation of an original channel (CH1, CH2, . . . CHn) can be calculated.
Normally, the carrier channel will include subband samples, spectral coefficients, time domain samples etc, which provide a comparatively fine representation of the underlying signal, while the parametric data do not include such samples of spectral coefficients but include control parameters for controlling a certain reconstruction algorithm such as weighting by multiplication, time shifting, frequency shifting, . . . The parametric data, therefore, include only a comparatively coarse representation of the signal or the associated channel. Stated in numbers, the amount of data required by a carrier channel will be in the range of 60-70 kbit/s, while the amount of data required by parametric side information for one channel will be in the range of 1.5-2.5 kbit/s. An example for parametric data are the well-known scale factors, intensity stereo information or binaural cue parameters as will be described below.
Intensity stereo coding is described in AES preprint 3799, “Intensity Stereo Coding”, J. Herre, K. H. Brandenburg, D. Lederer, February 1994, Amsterdam. Generally, the concept of intensity stereo is based on a main axis transform to be applied to the data of both stereophonic audio channels. If most or the data points are concentrated around the first principle axis, a coding gain can be achieved by rotating both signals by a certain angle prior to coding. This is, however, not always true for real stereophonic production techniques. Therefore, this technique is modified by excluding the second orthogonal component from transmission in the bit stream. Thus, the reconstructed signals for the left and right channels consist of differently weighted or scaled versions of the same transmitted signal. Nevertheless, the reconstructed signals differ in their amplitude but are identical regarding their phase information. The energy-time envelopes of both original audio channels, however, are preserved by means of the selective scaling operation, which typically operates in a frequency selective manner. This conforms to the human perception of sound at high frequencies, where the dominant spatial cues are determined by the energy envelopes.
Additionally, in practically implementations, the transmitted signal, i.e. the carrier channel is generated from the sum signal of the left channel and the right channel instead of rotating both components. Furthermore, this processing, i.e., generating intensity stereo parameters for performing the scaling operation, is performed frequency selective, i.e., independently for each scale factor band, i.e., encoder frequency partition. Preferably, both channels are combined to form a combined or “carrier” channel, and, in addition to the combined channel, the intensity stereo information is determined which depend on the energy of the first channel, the energy of the second channel or the energy of the combined or channel.
The BCC technique is described in AES convention paper 5574, “Binaural cue coding applied to stereo and multi-channel audio compression”, C. taller, F. Baumgarte, May 2002, Munich. In BCC encoding, a number of audio input channels are converted to a spectral representation using a DFT based transform with overlapping windows. The resulting uniform spectrum is divided into non-overlapping partitions each having an index. Each partition has a bandwidth proportional to the equivalent rectangular bandwidth (ERB). The inter-channel level differences (ICLD) and the inter-channel time differences (ICTD) are estimated for each partition for each frame k. The ICLD and ICTD are quantized and coded resulting in a BCC bit stream. The inter-channel level differences and inter-channel time differences are given for each channel relative to a reference channel. Then, the parameters are calculated in accordance with prescribed formulae, which depend on the certain partitions of the signal to be processed.
At a decoder-side, the decoder receives a mono signal and the BCC bit stream. The mono signal is transformed into the frequency domain and input into a spatial synthesis block, which also receives decoded ICLD and ICTD values. In the spatial synthesis block, the BCC parameters (ICLD and ICTD) values are used to perform a weighting operation of the mono signal in order to synthesize the multi-channel signals, which, after a frequency/time conversion, represent a reconstruction of the original multi-channel audio signal.
In case of BCC, the joint stereo module 60 is operative to output the channel side information such that the parametric channel data are quantized and encoded ICLD or ICTD parameters, wherein one of the original channels is used as the reference channel for coding the channel side information.
Normally, the carrier channel is formed of the sum of the participating original channels.
Naturally, the above techniques only provide a mono representation for a decoder, which can only process the carrier channel, but is not able to process the parametric data for generating one or more approximations of more than one input channel.
The audio coding technique known as binaural cue coding (BCC) is also well described in the U.S. patent application publications US 2003, 0219130 A1, 2003/0026441 A1 and 2003/0035553 A1. Additional reference is also made to “Binaural Cue Coding. Part II: Schemes and Applications”, C. Faller and F. Baumgarte, IEEE Trans. On Audio and Speech Proc., Vol. 11, No. 6, November 2993. The cited U.S. patent application publications and the two cited technical publications on the BCC technique authored by Faller and Baumgarte are incorporated herein by reference in their entireties.
In the following, a typical generic BCC scheme for multi-channel audio coding is elaborated in more detail with reference to FIGS. 11 to 13. FIG. 11 shows such a generic binaural cue coding scheme for coding/transmission of multi-channel audio signals. The multi-channel audio input signal at an input 110 of a BCC encoder 112 is downmixed in a downmix block 114. In the present example, the original multi-channel signal at the input 110 is a 5-channel surround signal having a front left channel, a front right channel, a left surround channel, a right surround channel and a center channel. In a preferred embodiment of the present invention, the downmix block 114 produces a sum signal by a simple addition of these five channels into a mono signal. Other downmixing schemes are known in the art such that, using a multi-channel input signal, a downmix signal having a single channel can be obtained. This single channel is output at a sum signal line 115. A side information obtained by a BCC analysis block 116 is output at a side information line 117. In the BCC analysis block, inter-channel level differences (ICLD), and inter-channel time differences (ICTD) are calculated as has been outlined above. Recently, the BCC analysis block 116 has been enhanced to also calculate inter-channel correlation values (ICC values). The sum signal and the side information is transmitted, preferably in a quantized and encoded form, to a BCC decoder 120. The BCC decoder decomposes the transmitted sum signal into a number of subbands and applies scaling, delays and other processing to generate the subbands of the output multi-channel audio signals. This processing is performed such that ICLD, ICTD and ICC parameters (cues) of a reconstructed multi-channel signal at an output 121 are similar to the respective cues for the original multi-channel signal at the input 110 into the BCC encoder 112. To this end, the BCC decoder 120 includes a BCC synthesis block 122 and a side information processing block 123.
In the following, the internal construction of the BCC synthesis block 122 is explained with reference to FIG. 12. The sum signal on line 115 is input into a time/frequency conversion unit or filter bank FB 125. At the output of block 125, there exists a number N of sub band signals or, in an extreme case, a block of a spectral coefficients, when the audio filter bank 125 performs a 1:1 transform, i.e., a transform which produces N spectral coefficients from N time domain samples.
The BCC synthesis block 122 further comprises a delay stage 126, a level modification stage 127, a correlation processing stage 128 and an inverse filter bank stage IFB 129. At the output of stage 129, the reconstructed multi-channel audio signal having for example five channels in case of a 5-channel surround system, can be output to a set of loud-speakers 124 as illustrated in FIG. 11.
As shown in FIG. 12, the input signal s(n) is converted into the frequency domain or filter bank domain by means of element 125. The signal output by element 125 is multiplied such that several versions of the same signal are obtained as illustrated by multiplication node 130. The number of versions of the original signal is equal to the number of output channels in the output signal to be reconstructed When, in general, each version of the original signal at node 130 is subjected to a certain delay d1, d2, . . . , di, . . . , dN. The delay parameters are computed by the side information processing block 123 in FIG. 11 and are derived from the inter-channel time differences as determined by the BCC analysis block 116.
The same is true for the multiplication parameters a1, a2, . . . , ai, . . . , aN, which are also calculated by the side information processing block 123 based on the inter-channel level differences as calculated by the BCC analysis block 116.
The ICC parameters calculated by the BCC analysis block 116 are used for controlling the functionality of block 128 such that certain correlations between the delayed and level-manipulated signals are obtained at the outputs of block 128. It is to be noted here that the ordering of the stages 126, 127, 128 may be different from the case shown in FIG. 12.
It is to be noted here that, in a frame-wise processing of an audio signal, the SCC analysis is performed frame-wise, i.e. time-varying, and also frequency-wise. This means that, for each spectral band, the BCC parameters are obtained. This means that, in case the audio filter bank 125 decomposes the input signal into for example 32 band pass signals, the BCC analysis block obtains a set of BCC parameters for each of the 32 bands. Naturally the BCC synthesis block 122 from FIG. 11, which is shown in detail in FIG. 12, performs a reconstruction which is also based on the 32 bands in the example.
In the following, reference is made to FIG. 13 showing a setup to determine certain BCC parameters. Normally, ICLD, ICTD and ICC parameters can be defined between pairs of channels. However, it is preferred to determine ICLD and ICTD parameters between a reference channel and each other channel. This is illustrated in FIG. 13A. ICC parameters can be defined in different ways. Most generally, one could estimate ICC parameters in the encoder between all possible channel pairs as indicated in FIG. 13B. In this case, a decoder would synthesize ICC such that it is approximately the same as in the original multi-channel signal between all possible channel pairs. It was, however, proposed to estimate only ICC parameters between the strongest two channels at each time. This scheme is illustrated in FIG. 13C, where an example is shown, in which at one time instance, an ICC parameter is estimated between channels 1 and 2, and, at another time instance, an ICC parameter is calculated between channels 1 and 5. The decoder then synthesizes the inter-channel correlation between the strongest channels in the decoder and applies some heuristic rule for computing and synthesizing the inter-channel coherence for the remaining channel pairs.
Regarding the calculation of, for example, the multiplication parameters a1, aN based on transmitted ICLD parameters, reference is made to AES convention paper 5574 cited above. The ICLD parameters represent an energy distribution in an original multi-channel signal. Without loss of generality, it is shown in FIG. 13A that there are four ICLD parameters showing the energy difference between all other channels and the front left channel. In the side information processing block 123, the multiplication parameters a1, . . . , aN are derived from the ICLD parameters such chat the total energy of all reconstructed output channels is the same as (or proportional to) the energy of the transmitted sum signal. A simple way for determining these parameters is a 2-stage process, in which, in a first stage, the multiplication factor for the left front channel is set to unity, while multiplication factors for the other channels in FIG. 13A are set to the transmitted ICLD values. Then, in a second stage, the energy of all five channels is calculated and compared to the energy of the transmitted sum signal. Then, all channels are downscaled using a downscaling factor which is equal for all channels, wherein the downscaling factor is selected such that the total energy of all reconstructed output channels is, after downscaling, equal to the total energy of the transmitted sum signal.
Naturally, there are other methods for calculating the multiplication factors, which do not rely on the 2-stage process but which only need a 1-stage process.
Regarding the delay parameters, it is to be noted that the delay parameters ICTD, which are transmitted from a BCC encoder can be used directly, when the delay parameter d, for the left front channel is set to zero. No resealing has to be done here, since a delay does not alter the energy of the signal.
Regarding the inter-channel coherence measure ICC transmitted from the BCC encoder to the BCC decoder, it is to be noted here that a coherence manipulation can be done by modifying the multiplication factors a1, . . . , an such as by multiplying the weighting factors of all subbands with random numbers with values between 20 log 10(−6) and 20 log 10(6). The pseudo-random sequence is preferably chosen such that the variance is approximately constant for all critical bands, and the average is zero within each critical band. The same sequence is applied to the spectral coefficients for each different frame. Thus, the auditory image width is controlled by modifying the variance of the pseudo-random sequence. A larger variance creates a larger image width.
The variance modification can be performed in individual bands that are critical-band wide. This enables the simultaneous existence of multiple objects in an auditory scene, each object having a different image width. A suitable amplitude distribution for the pseudo-random sequence is a uniform distribution on a logarithmic scale as it is outlined in the U.S. patent application publication 2003/0219130 A1. Nevertheless, all BCC synthesis processing is related to a single input channel transmitted as the sum signal from the BCC encoder to the BCC decoder as shown in FIG.
To transmit the five channels in a compatible way, i.e., in a bitstream format, which is also understandable for a normal stereo decoder, the so-called matrixing technique has been used as described in “MUSICAM surround: a universal multi-channel coding system compatible with ISO 11172-3”, G. Theile and G. Stoll, AES preprint 3403, October 1992, San Francisco. The five input channels L, R, C, Ls, and Rs are fed into a matrixing device performing a matrixing operation to calculate the basic or compatible stereo channels Lo, Ro, from the five input channels. In particular, these basic stereo channels Lo/Ro are calculated as set out below:Lo=L+xC+yLsRo=R+xC+yRsx and y are constants. The other three channels C, Ls, Rs are transmitted as they are in an extension layer, in addition to a basic stereo layer, which includes an encoded version of the basic stereo signals Lo/Ro. With respect to the bitstream, this Lo/Ro basic stereo layer includes a header, information such as scale factors and subband samples. The multi-channel extension layer, i.e., the central channel and the two surround channels are included in the multi-channel extension field, which is also called ancillary data field.
At a decoder-side, an inverse matrixing operation is performed in order to form reconstructions of the left and right channels in the five-channel representation using the basic stereo channels Lo, Ro and the three additional channels. Additionally, the three additional channels are decoded from the ancillary information in order to obtain a decoded five-channel or surround representation of the original multi-channel audio signal.
Another approach for multi-channel encoding is described in the publication “Improved MPEG-2 audio multi-channel encoding”, B. Grill, J. Herre, K. H. Brandenburg, E. Eberlein, J. Koller, J. Mueller, AES preprint 3865, February 1994, Amsterdam, in which, in order to obtain backward compatibility, backward compatible modes are considered. To this end, a compatibility matrix is used to obtain two so-called downmix channels Lc, Rc from the original five input channels. Furthermore, it is possible to dynamically select the three auxiliary channels transmitted as ancillary data.
In order to exploit stereo irrelevancy, a joint stereo technique is applied to groups of channels, e. g. the three front channels, i.e., for the left channel, the right channel and the center channel. To this end, these three channels are combined to obtain a combined channel. This combined channel is quantized and packed into the bitstream.
Then, this combined channel together with the corresponding joint stereo information is input into a joint stereo decoding module to obtain joint stereo decoded channels, i.e., a joint stereo decoded left channel, a joint stereo decoded right channel and a joint stereo decoded center channel. These joint stereo decoded channels are, together with the left surround channel and the right surround channel input into a compatibility matrix block to form the first and the second downmix channels Lc, Rc. Then, quantized versions of both downmix channels and a quantized version of the combined channel are packed into the bitstream together with joint stereo coding parameters.
Using intensity stereo coding, therefore, a group of independent original channel signals is transmitted within a single portion of “carrier” data. The decoder then reconstructs the involved signals as identical data, which are rescaled according to their original energy-time envelopes. Consequently, a linear combination of the transmitted channels will lead to results, which are quite different from the original downmix. This applies to any kind of joint stereo coding based on the intensity stereo concept. For a coding system providing compatible downmix channels, there is a direct consequence: The reconstruction by dematrixing, as described in the previous publication, suffers from artifacts caused by the imperfect reconstruction. Using a so-called joint stereo predistortion scheme, in which a joint stereo coding of the left, the right and the center channels is performed before matrixing in the encoder, alleviates this problem. In this way, the dematrixing scheme for reconstruction introduces fewer artifacts, since, on the encoder-side, the joint stereo decoded signals have been used for generating the downmix channels. Thus, the imperfect reconstruction process is shifted into the compatible downmix channels Lc and Pc, where it is much more likely to be masked by the audio signal itself.
Although such a system has resulted in fewer artifacts because of dematrixing on the decoder-side, it nevertheless has some drawbacks. A drawback is that the stereo-compatible downmix channels Lc and Rc are derived not from the original channels but from intensity stereo coded/decoded versions of the original channels. Therefore, data losses because of the intensity stereo coding system are included in the compatible downmix channels. Astereo-only decoder, which only decodes the compatible channels rather than the enhancement intensity stereo encoded channels, therefore, provides an output signal, which is affected by intensity stereo induced data losses.
Additionally, a full additional channel has to be transmitted besides the two downmix channels. This channel is the combined channel, which is formed by means of joint stereo coding of the left channel, the right channel and the center channel. Additionally, the intensity stereo information to reconstruct the original channels L, R, C from the combined channel also has to be transmitted to the decoder. At the decoder, an inverse matrixing, i.e., a dematrixing operation is performed to derive the surround channels from the two downmix channels. Additionally, the original left, right and center channels are approximated by joint stereo decoding using the transmitted combined channel and the transmitted joint stereo parameters. It is to be noted that the original left, right and center channels are derived by joint stereo decoding of the combined channel.
It has been found out that in case of intensity stereo techniques, when used in combination with multi-channel signals, only fully coherent output signals which are based on the same base channel can be produced.
In BCC techniques, it is quite expensive to reduce the inter-channel coherence in a reconstructed multi-channel output signal, since a pseudo-random number generator for influencing the weighting sectors is required. Additionally, it has been shown that this kind of processing is problematic in that artifacts because of randomly manipulating multiplication factors or time delay factors can be introduced which can become audible under certain circumstances and, therefore, deteriorate the quality of the reconstructed multi-channel output signal.