1. Field of the Invention
The present invention relates to coding of multi-channel representations of audio signals using spatial parameters. The invention teaches new methods for defining and estimating parameters for recreating a multi-channel signal from a number of channels being less than the number of output channels. In particular it aims at minimizing the bitrate for the multi-channel representation, and providing a coded representation of the multi-channel signal enabling easy encoding and decoding of the data for all possible channel configurations.
2. Description of the Related Art
With a growing interest for multi-channel audio in e.g. broadcasting systems, the demand for a digital low bitrate audio coding technique is obvious. It has been shown in PCT/SE02/01372 “Efficient and scalable Parametric Stereo Coding for Low Bitrate Audio Coding Applications”, that it is possible to re-create a stereo image that closely resembles the original stereo image, from a mono downmix signal and an additional very compact parametric representation of the stereo image. The basic principle is to divide the input signal into frequency bands and time segments, and for these frequency bands and time segments, estimate inter-channel intensity difference (IID), and inter-channel coherence (ICC), the first parameter being a measurement of the power distribution between the two channels in the specific frequency band and the second parameter being an estimation of the correlation between the two channels for the specific frequency band. On the decoder side the stereo image is recreated from the mono signal by distributing the mono signal between the two output channels in accordance with the transmitted IID-data, and by adding a decorrelated ambience signal in order to retain the channel correlation properties of the original stereo channels.
Several matrixing techniques exist that create multi-channel output from stereo signals. These techniques often rely on phase differences to create the back channels. Often, the back channels are delayed slightly compared to the front channels. To maximise performance the stereo file is created using special down mixing rules on the encoder side from a multi-channel signal to two stereo base channels. These systems generally have a stable front sound image with some ambience sound in the back channels and there is a limited ability to separate complex sound material into different speakers.
Several multi-channel configurations exist. The most commonly known configuration is the 5.1 configuration (centre channel, front left/right, surround left/right, and the LFE channel). ITU-R BS.775 defines several down-mix schemes for obtaining a channel configuration comprising fewer channels than a given channel configuration. Instead of always having to decode all channels and rely on a down-mix, it can be desirable to have a multi-channel representation that enables a receiver to extract the parameters relevant for the playback channel configuration at hand, prior to decoding the channels. Another alternative is to have parameters that can map to any speaker combination at the decoder side. Furthermore, a parameter set that is inherently scaleable is desirable from a scalable or embedded coding point of view, where it is e.g. possible to store the data corresponding to the surround channels in an enhancement layer in the bitstream.
Another representation of multi-channel signals using a sum signal or down mix signal and additional parametric side information is known in the art as binaural cue coding (BCC). This technique is described in “Binaural Cue Coding—Part 1: Psycho-Acoustic Fundamentals and Design Principles”, IEEE Transactions on Speech and Audio Processing, vol. 11, No. 6, November 2003, F. Baumgarte, C. Faller, and “Binaural Cue Coding. Part II: Schemes and Applications”, IEEE Transactions on Speech and Audio Processing vol. 11, No. 6, November 2003, C. Faller and F. Baumgarte.
Generally, binaural cue coding is a method for multi-channel spatial rendering based on one down-mixed audio channel and side information. Several parameters to be calculated by a BCC encoder and to be used by a BCC decoder for audio reconstruction or audio rendering include inter-channel level differences, inter-channel time differences, and inter-channel coherence parameters. These inter-channel cues are the determining factor for the perception of a spatial image. These parameters are given for blocks of time samples of the original multi-channel signal and are also given frequency-selective so that each block of multi-channel signal samples have several cues for, several frequency bands. In the general case of C playback channels, the inter-channel level differences and the inter-channel time differences are considered in each subband between pairs of channels, i.e., for each channel relative to a reference channel. One channel is defined as the reference channel for each inter-channel level difference. With the inter-channel level differences and the inter-channel time differences, it is possible to render a source to any direction between one of the loudspeaker pairs of a playback set-up that is used. For determining the width or diffuseness of a rendered source, it is enough to consider one parameter per subband for all audio channels. This parameter is the inter-channel coherence parameter. The width of the rendered source is controlled by modifying the subband signals such that all possible channel pairs have the same inter-channel coherence parameter.
In BCC coding, all inter-channel level differences are determined between the reference channel 1 and any other channel. When, for example, the centre channel is determined to be the reference channel, a first inter-channel level difference between the left channel and the centre channel, a second inter-channel level difference between the right channel and the centre channel, a third inter-channel level difference between the left surround channel and the centre channel, and a forth inter-channel level difference between the right surround channel and the centre channel are calculated. This scenario describes a five-channel scheme. When the five-channel scheme additionally includes a low frequency enhancement channel, which is also known as a “sub-woofer” channel, a fifth inter-channels level difference between the low frequency enhancement channel and the centre channel, which is the single reference channel, is calculated.
When reconstructing the original multi-channel using the single down mix channel, which is also termed as the “mono” channel, and the transmitted cues such as ICLD (Interchannel Level Difference), ICTD (Interchannel Time Difference), and ICC (Interchannel Coherence), the spectral coefficients of the mono signal are modified using these cues. The level modification is performed using a positive real number determining the level modification for each spectral coefficient. The inter-channel time difference is generated using a complex number of magnitude of one determining a phase modification for each spectral coefficient. Another function determines the coherence influence. The factors for level modifications of each channel are computed by firstly calculating the factor for the reference channel. The factor for the reference channel is computed such that for each frequency partition, the sum of the power of all channels is the same as the power of the sum signal. Then, based on the level modification factor for the reference channel, the level modification factors for the other channels are calculated using the respective ICLD parameters.
Thus, in order to perform BCC synthesis, the level modification factor for the reference channel is to be calculated. For this calculation, all ICLD parameters for a frequency band are necessary. Then, based on this level modification for the single channel, the level modification factors for the other channels, i.e., the channels, which are not the reference channel, can be calculated.
This approach is disadvantageous in that, for a perfect reconstruction, one needs each and every inter-channel level difference. This requirement is even more problematic, when an error-prone transmission channel is present. Each error within a transmitted inter-channel level difference will result in an error in the reconstructed multi-channel signal, since each inter-channel level difference is required to calculate each one of the multi-channel output signal. Additionally, no reconstruction is possible, when an inter-channel level difference has been lost during transmission, although this inter-channel level difference was only necessary for e.g. the left surround channel or the right surround channel, which channels are not so important to multi-channel reconstruction, since most of the information is included in the front left channel, which is subsequently called the left channel, the front right channel, which is subsequently called the right channel, or the centre channel. This situation becomes even worse, when the inter-channel level difference of the low frequency enhancement channel has been lost during transmission. In this situation, no or only an erroneous multi-channel reconstruction is possible, although the low frequency enhancement channel is not so decisive for the listeners' listening comfort. Thus, errors in a single inter-channel level difference are propagated to errors within each of the reconstructed output channels.
While such multi-channel parameterization schemes are based on the intention to fully reconstruct the energy distribution, the price one has to pay for this correct reconstruction of the energy distribution is an increased bit rate, since a lot of inter-channel level differences or balance parameters for the spatial energy distribution have to be transmitted. Although these energy distribution schemes naturally do not perform an exact reconstruction of time wave forms of the original channels, they nevertheless result in a sufficient output channel quality because of the exact energy distribution property.
For low-bit rate applications, however, these schemes still require too many bits, which has resulted in the consequence that for such low-bit rate applications, one did not think of a multi-channel reconstruction but one was satisfied with having a mono or stereo reconstruction only.