In recent times, the multi-channel audio reproduction technique is becoming more and more important. This may be due to the fact that audio compression/encoding techniques such as the well-known mp3 technique have made it possible to distribute audio records via the Internet or other transmission channels having a limited bandwidth. The mp3 coding technique has become so famous because of the fact that it allows distribution of all the records in a stereo format, i.e., a digital representation of the audio record including a first or left stereo channel and a second or right stereo channel.
Nevertheless, there are basic shortcomings of conventional two-channel sound systems. Therefore, the surround technique has been developed. A recommended multi-channel-surround representation includes, in addition to the two stereo channels L and R, an additional center channel C and two surround channels Ls, Rs. This reference sound format is also referred to as three/two-stereo, which means three front channels and two surround channels. Generally, five transmission channels are required. In a playback environment, at least five speakers at the respective five different places are needed to get an optimum sweet spot in a certain distance from the five well-placed loudspeakers.
Several techniques are known in the art for reducing the amount of data required for transmission of a multi-channel audio signal. Such techniques are called joint stereo techniques. To this end, reference is made to FIG. 10, which shows a joint stereo device 60. This device can be a device implementing e.g. intensity stereo (IS) or binaural cue coding (BCC) Such a device generally receives—as an input—at least two channels (CH1, CH2, . . . CHn), and outputs a single carrier channel and parametric data. The parametric data are defined such that, in a decoder, an approximation of an original channel (CH1, CH2, . . . CHn) can be calculated.
Normally, the carrier channel will include subband samples, spectral coefficients, time domain samples etc, which provide a comparatively fine representation of the underlying signal, while the parametric data do not include such samples of spectral coefficients but include control parameters for controlling a certain reconstruction algorithm such as weighting by multiplication, time shifting, frequency shifting, phase shifting, . . . The parametric data, therefore, include only a comparatively coarse representation of the signal or the associated channel. Stated in numbers, the amount of data required by a carrier channel will be in the range of 60-70 kbit/s, while the amount of data required by parametric side information for one channel will be in the range of 1,5-2,5 kbit/s. An example for parametric data are the well-known scale factors, intensity stereo information or binaural cue parameters as will be described below.
Intensity stereo coding is described in AES preprint 3799, “Intensity Stereo Coding”, J. Herre, K. H. Brandenburg, D. Lederer, February 1994, Amsterdam. Generally, the concept of intensity stereo is based on a main axis transform to be applied to the data of both stereophonic audio channels. If most of the data points are concentrated around the first principle axis, a coding gain can be achieved by rotating both signals by a certain angle prior to coding. This is, however, not always true for real stereophonic production techniques. Therefore, this technique is modified by excluding the second orthogonal component from transmission in the bit stream. Thus, the reconstructed signals for the left and right channels consist of differently weighted or scaled versions of the same transmitted signal. Nevertheless, the reconstructed signals differ in their amplitude but are identical regarding their phase information. The energy-time envelopes of both original audio channels, however, are preserved by means of the selective scaling operation, which typically operates in a frequency selective manner. This conforms to the human perception of sound at high frequencies, where the dominant spatial cues are determined by the energy envelopes.
Additionally, in practical implementations, the transmitted signal, i.e. the carrier channel is generated from the sum signal of the left channel and the right channel instead of rotating both components. Furthermore, this processing, i.e., generating intensity stereo parameters for performing the scaling operation, is performed frequency selective, i.e., independently for each scale factor band, i.e., encoder frequency partition. Preferably, both channels are combined to form a combined or “carrier” channel, and, in addition to the combined channel, the intensity stereo information is determined which depend on the energy of the first channel, the energy of the second channel or the energy of the combined or channel.
The BCC technique is described in AES convention paper 5574, “Binaural cue coding applied to stereo and multi-channel audio compression”, C. Faller, F. Baumgarte, May 2002, Munich. In BCC encoding, a number of audio input channels are converted to a spectral representation using a DFT based transform with overlapping windows. The resulting uniform spectrum is divided into non-overlapping partitions each having an index. Each partition has a bandwidth proportional to the equivalent rectangular bandwidth (ERB). The inter-channel level differences (ICLD) and the inter-channel time differences (ICTD) are estimated for each partition for each frame k. The ICLD and ICTD are quantized and coded resulting in a BCC bit stream. The inter-channel level differences and inter-channel time differences are given for each channel relative to a reference channel. Then, the parameters are calculated in accordance with prescribed formulae, which depend on the certain partitions of the signal to be processed.
At a decoder-side, the decoder receives a mono signal and the BCC bit stream. The mono signal is transformed into the frequency domain and input into a spatial synthesis block, which also receives decoded ICLD and ICTD values. In the spatial synthesis block, the BCC parameters (ICLD and ICTD) values are used to perform a weighting operation of the mono signal in order to synthesize the multi-channel signals, which, after a frequency/time conversion, represent a reconstruction of the original multi-channel audio signal.
In case of BCC, the joint stereo module 60 is operative to output the channel side information such that the parametric channel data are quantized and encoded ICLD or ICTD parameters, wherein one of the original channels is used as the reference channel for coding the channel side information.
Normally, the carrier channel is formed of the sum of the participating original channels.
Naturally, the above techniques only provide a mono representation for a decoder, which can only process the carrier channel, but is not able to process the parametric data for generating one or more approximations of more than one input channel.
The audio coding technique known as binaural cue coding (BCC) is also well described in the United States patent application publications U.S. 2003, 0219130 A1, 2003/0026441 A1 and 2003/0035553 A1. Additional reference is also made to “Binaural Cue Coding. Part II: Schemes and Applications”, C. Faller and F. Baumgarte, IEEE Trans. On Audio and Speech Proc., Vol. 11, No. 6, November 1993. The cited United States patent application publications and the two cited technical publications on the BCC technique authored by Faller and Baumgarte are incorporated herein by reference in their entireties.
In the following, a typical generic BCC scheme for multi-channel audio coding is elaborated in more detail with reference to FIGS. 11 to 13. FIG. 11 shows such a generic binaural cue coding scheme for coding/transmission of multi-channel audio signals. The multi-channel audio input signal at an input 110 of a BCC encoder 112 is down mixed in a down mix block 114. In the present example, the original multi-channel signal at the input 110 is a 5-channel surround signal having a front left channel, a front right channel, a left surround channel, a right surround channel and a center channel. In a preferred embodiment of the present invention, the down mix block 114 produces a sum signal by a simple addition of these five channels into a mono signal. Other down mixing schemes are known in the art such that, using a multi-channel input signal, a down mix signal having a single channel can be obtained. This single channel is output at a sum signal line 115. A side information obtained by a BCC analysis block 116 is output at a side information line 117. In the BCC analysis block, inter-channel level differences (ICLD), and inter-channel time differences (ICTD) are calculated as has been outlined above. Recently, the BCC analysis block 116 has been enhanced to also calculate inter-channel correlation values (ICC values). The sum signal and the side information is transmitted, preferably in a quantized and encoded form, to a BCC decoder 120. The BCC decoder decomposes the transmitted sum signal into a number of subbands and applies scaling, delays and other processing to generate the subbands of the output multi-channel audio signals. This processing is performed such that ICLD, ICED and ICC parameters (cues) of a reconstructed multi-channel signal at an output 121 are similar to the respective cues for the original multi-channel signal at the input 110 into the BCC encoder 112. To this end, the BCC decoder 120 includes a BCC synthesis block 122 and a side information processing block 123.
In the following, the internal construction of the BCC synthesis block 122 is explained with reference to FIG. 12. The sum signal on line 115 is input into a time/frequency conversion unit or filter bank FB 125. At the output of block 125, there exists a number N of sub band signals or, in an extreme case, a block of a spectral coefficients, when the audio filter bank 125 performs a 1:1 transform, i.e., a transform which produces N spectral coefficients from N time domain samples.
The BCC synthesis block 122 further comprises a delay stage 126, a level modification stage 127, a correlation processing stage 128 and an inverse filter bank stage IFB 129. At the output of stage 129, the reconstructed multi-channel audio signal having for example five channels in case of a 5-channel surround system, can be output to a set of loudspeakers 124 as illustrated in FIG. 11.
As shown in FIG. 12, the input signal s(n) is converted into the frequency domain or filter bank domain by means of element 125. The signal output by element 125 is multiplied such that several versions of the same signal are obtained as illustrated by multiplication node 130. The number of versions of the original signal is equal to the number of output channels in the output signal to be reconstructed When, in general, each version of the original signal at node 130 is subjected to a certain delay d1, d2, . . . , di, . . . , dN. The delay parameters are computed by the side information processing block 123 in FIG. 11 and are derived from the inter-channel time differences as determined by the BCC analysis block 116.
The same is true for the multiplication parameters a1, a2, . . . , ai, . . . , aN, which are also calculated by the side information processing block 123 based on the inter-channel level differences as calculated by the BCC analysis block 116.
The ICC parameters calculated by the BCC analysis block 116 are used for controlling the functionality of block 128 such that certain correlations between the delayed and level-manipulated signals are obtained at the outputs of block 128. It is to be noted here that the ordering of the stages 126, 127, 128 may be different from the case shown in FIG. 12.
It is to be noted here that, in a frame-wise processing of an audio signal, the BCC analysis is performed frame-wise, i.e. time-varying, and also frequency-wise. This means that, for each spectral band, the BCC parameters are obtained. This means that, in case the audio filter bank 125 decomposes the input signal into for example 32 band pass signals, the BCC analysis block obtains a set of BCC parameters for each of the 32 bands. Naturally the BCC synthesis block 122 from FIG. 11, which is shown in detail in FIG. 12, performs a reconstruction which is also based on the 32 bands in the example.
In the following, reference is made to FIG. 13 showing a setup to determine certain BCC parameters. Normally, ICLD, ICTD and ICC parameters can be defined between pairs of channels. However, it is preferred to determine ICLD and ICTD parameters between a reference channel and each other channel. This is illustrated in FIG. 13A.
ICC parameters can be defined in different ways. Most generally, one could estimate ICC parameters in the encoder between all possible channel pairs as indicated in FIG. 13B. In this case, a decoder would synthesize ICC such that it is approximately the same as in the original multi-channel signal between all possible channel pairs. It was, however, proposed to estimate only ICC parameters between the strongest two channels at each time. This scheme is illustrated in FIG. 13C, where an example is shown, in which at one time instance, an ICC parameter is estimated between channels 1 and 2, and, at another time instance, an ICC parameter is calculated between channels 1 and 5. The decoder then synthesizes the inter-channel correlation between the strongest channels in the decoder and applies some heuristic rule for computing and synthesizing the inter-channel coherence for the remaining channel pairs.
Regarding the calculation of, for example, the multiplication parameters a1, aN based on transmitted ICLD parameters, reference is made to AES convention paper 5574 cited above. The ICLD parameters represent an energy distribution in an original multi-channel signal. Without loss of generality, it is shown in FIG. 13A that there are four ICLD parameters showing the energy difference between all other channels and the front left channel. In the side information processing block 123, the multiplication parameters a1, . . . , aN are derived from the ICLD parameters such that the total energy of all reconstructed output channels is the same as (or proportional to) the energy of the transmitted sum signal. A simple way for determining these parameters is a 2-stage process, in which, in a first stage, the multiplication factor for the left front channel is set to unity, while multiplication factors for the other channels in FIG. 13A are set to the transmitted ICLD values. Then, in a second stage, the energy of all five channels is calculated and compared to the energy of the transmitted sum signal. Then, all channels are downscaled using a downscaling factor which is equal for all channels, wherein the downscaling factor is selected such that the total energy of all reconstructed output channels is, after downscaling, equal to the total energy of the transmitted sum signal.
Naturally, there are other methods for calculating the multiplication factors, which do not rely on the 2-stage process but which only need a 1-stage process.
Regarding the delay parameters, it is to be noted that the delay parameters ICTD, which are transmitted from a BCC encoder can be used directly, when the delay parameter d1 for the left front channel is set to zero. No resealing has to be done here, since a delay does not alter the energy of the signal.
Regarding the inter-channel coherence measure ICC transmitted from the BCC encoder to the BCC decoder, it is to be noted here that a coherence manipulation can be done by modifying the multiplication factors a1, . . . , an such as by multiplying the weighting factors of all subbands with random numbers with values between 20log10(−6) and 20log10(6). The pseudo-random sequence is preferably chosen such that the variance is approximately constant for all critical bands, and the average is zero within each critical band. The same sequence is applied to the spectral coefficients for each different frame. Thus, the auditory image width is controlled by modifying the variance of the pseudo-random sequence. A larger variance creates a larger image width. The variance modification can be performed in individual bands that are critical-band wide. This enables the simultaneous existence of multiple objects in an auditory scene, each object having a different image width. A suitable amplitude distribution for the pseudo-random sequence is a uniform distribution on a logarithmic scale as it is outlined in the US patent application publication 2003/0219130 A1. Nevertheless, all BCC synthesis processing is related to a single input channel transmitted as the sum signal from the BCC encoder to the BCC decoder as shown in FIG. 11.
A related technique, also known as parametric stereo, is described in J. Breebaart, S. van de Par, A. Kohlrausch, E. Schuijers, “High-Quality Parametric Spatial Audio Coding at Low Bitrates”, AES 116th Convention, Berlin, Preprint 6072, May 2004, and E. Schuijers, J. Breebaart, H. Purnhagen, J. Engdegard, “Low Complexity Parametric Stereo Coding”, AES 116th Convention, Berlin, Preprint 6073, May 2004.
As has been outlined above with respect to FIG. 13, the parametric side information, i.e., the interchannel level differences (ICLD), the interchannel time differences (ICTD) or the interchannel coherence parameter (ICC) can be calculated and transmitted for each of the five channels. This means that one, normally, transmits five sets of inter-channel level differences for a five channel signal. The same is true for the interchannel time differences. With respect to the interchannel coherence parameter, it can also be sufficient to only transmit for example two sets of these parameters.
As has been outlined above with respect to FIG. 12, there is not a single level difference parameter, time difference parameter or coherence parameter for one frame or time portion of a signal. Instead, these parameters are determined for several different frequency bands so that a frequency-dependent parametrization is obtained. Since it is preferred to use for example 32 frequency channels, i.e., a filter bank having 32 frequency bands for BCC analysis and BCC synthesis, the parameters can occupy quite a lot of data. Although—compared to other multi-channel transmissions—the parametric representation results in a quite low data rate, there is a continuing need for further reduction of the necessary data rate for representing a multi-channel signal such as a signal having two channels (stereo signal) or a signal having more than two channels such as a multi-channel surround signal.
To this end, the encoder-side calculated reconstruction parameters are quantized in accordance with a certain quantization rule. This means that unquantized reconstruction parameters are mapped onto a limited set of quantization levels or quantization indices as it is known in the art and described in detail in C. Faller and F. Baumgarte, “Binaural cue coding applied to audio compression with flexible rendering,” AES 113th Convention, Los Angeles, Preprint 5686, October 2002.
Quantization has the effect that all parameter values, which are smaller than the quantization step size, are quantized to zero. Additionally, by mapping a large set of unquantized values to a small set of quantized values results in data saving per se. These data rate savings are further enhanced by entropy-encoding the quantized reconstruction parameters on the encoder-side. Preferred entropy-encoding methods are Huffman methods based on predefined code tables or based on an actual determination of signal statistics and signal-adaptive construction of codebooks. Alternatively, other entropy-encoding tools can be used such as arithmetic encoding.
Generally, one has the rule that the data rate required for the reconstruction parameters decreases with increasing quantizer step size. Stated in other words, a coarser quantization results in a lower data rate, and a finer quantization results in a higher data rate.
Since parametric signal representations are normally required for low data rate environments, one tries to quantize the reconstruction parameters as coarse as possible to obtain a signal representation having a certain amount of data in the base channel, and also having a reasonable small amount of data for the side information which include the quantized and entropy-encoded reconstruction parameters.
Prior art methods, therefore, derive the reconstruction parameters to be transmitted directly from the multi-channel signal to be encoded. A coarse quantization as discussed above results in reconstruction parameter distortions, which result in large rounding errors, when the quantized reconstruction parameter is inversely quantized in a decoder and used for multi-channel synthesis. Naturally, the rounding error increases with the quantizer step size, i.e., with the selected “quantizer coarseness”. Such rounding errors may result in a quantization level change, i.e., in a change from a first quantization level at a first time instant to a second quantization level at a later time instant, wherein the difference between one quantizer level and another quantizer level is defined by the quite large quantizer step size, which is preferable for a coarse quantization. Unfortunately, such a quantizer level change amounting to the large quantizer step size can be triggered by only a small parameter change, when the unquantized parameter is in the middle between two quantization levels. It is clear that the occurrence of such quantizer index changes in the side information results in the same strong changes in the signal synthesis stage. When—as an example—the interchannel level difference is considered, it becomes clear that a strong change results in a sharp decrease of loudness of a certain loudspeaker signal and an accompanying sharp increase of the loudness of a signal for another loudspeaker. This situation, which is only triggered by a quantization level change and a coarse quantization can be perceived as an immediate relocation of a sound source from a (virtual) first place to a (virtual) second place. Such an immediate relocation from one time instant to another time instant sounds unnatural, i.e., is perceived as a modulation effect, since sound sources of, in particular, tonal signals do not change their location very fast.
Generally, also transmission errors may result in sharp changes of quantizer indices, which immediately result in the sharp changes in the multi-channel output signal, which is even more true for situations, in which a coarse quantizer for data rate reasons has been adopted.