1. Field of the Invention
The present invention relates to coding of multi-channel representations of audio signals using spatial parameters. The present invention teaches new methods for estimating and defining proper parameters for recreating a multi-channel signal from a number of channels being less than the number of output channels. In particular it aims at minimizing the bit rate for the multi-channel representation, and providing a coded representation of the multi-channel signal enabling easy encoding and decoding of the data for all possible channel configurations.
2. Description of the Related Art
It has been shown in PCT/SE02/01372 “Efficient and scalable Parametric Stereo Coding for Low Bit rate Audio Coding Applications”, that it is possible to re-create a stereo image that closely resembles the original stereo image, from a mono signal given a very compact representation of the stereo image. The basic principle is to divide the input signal into frequency bands and time segments, and for these frequency bands and time segments, estimate inter-channel intensity difference (IID), and inter-channel coherence (ICC). The first parameter is a measurement of the power distribution between the two channels in the specific frequency band and the second parameter is an estimation of the correlation between the two channels for the specific frequency band. On the decoder side the stereo image is recreated from the mono signal by distributing the mono signal between the two output channels in accordance with the IID-data, and by adding a decorrelated signal in order to retain the channel correlation of the original stereo channels.
For a multi-channel case (multi-channel in this context meaning more than two output channels), several additional issues have to be accounted for. Several multi-channel configurations exist. The most commonly known is the 5.1 configuration (center channel, front left/right, surround left/right, and the LFE channel). However, many other configurations exist. From the complete encoder/decoder systems point-of-view, it is desirable to have a system that can use the same parameter set (e.g. IID and ICC) or subsets thereof for all channel configurations. ITU-R BS.775 defines several down-mix schemes to be able to obtain a channel configuration comprising fewer channels from a given channel configuration. Instead of always having to decode all channels and rely on a down-mix, it can be desirable to have a multi-channel representation that enables a receiver to extract the parameters relevant for the channel configuration at hand, prior to decoding the channels. Further, a parameter set that is inherently scaleable is desirable from a scalable or embedded coding point of view, where it is e.g. possible to store the data corresponding to the surround channels in an enhancement layer in the bitstream.
Contrary to the above it can also be desirable to be able to use different parameter definitions based on the characteristics of the signal being processed, in order to switch between the parameterization that results in the lowest bit rate overhead for the current signal segment being processed.
Another representation of multi-channel signals using a sum signal or down mix signal and additional parametric side information is known in the art as binaural cue coding (BCC). This technique is described in “Binaural Cue Coding—Part 1: Psycho-Acoustic Fundamentals and Design Principles”, IEEE Transactions on Speech and Audio Processing, vol. 11, No. 6, November 2003, F. Baumgarte, C. Faller, and “Binaural Cue Coding. Part II: Schemes and Applications”, IEEE Transactions on Speech and Audio Processing vol. 11, No. 6, November 2003, C. Faller and F. Baumgarte.
Generally, binaural cue coding is a method for multi-channel spatial rendering based on one down-mixed audio channel and side information. Several parameters to be calculated by a BCC encoder and to be used by a BCC decoder for audio reconstruction or audio rendering include inter-channel level differences, inter-channel time differences, and inter-channel coherence parameters. These inter-channel cues are the determining factor for the perception of a spatial image. These parameters are given for blocks of time samples of the original multi-channel signal and are also given frequency-selective so that each block of multi-channel signal samples have several cues for several frequency bands. In the general case of C playback channels, the inter-channel level differences and the inter-channel time differences are considered in each subband between pairs of channels, i.e., for each channel relative to a reference channel. One channel is defined as the reference channel for each inter-channel level difference. With the inter-channel level differences and the inter-channel time differences, it is possible to render a source to any direction between one of the loudspeaker pairs of a playback set-up that is used. For determining the width or diffuseness of a rendered source, it is enough to consider one parameter per subband for all audio channels. This parameter is the inter-channel coherence parameter. The width of the rendered source is controlled by modifying the subband signals such that all possible channel pairs have the same inter-channel coherence parameter.
In BCC coding, all inter-channel level differences are determined between the reference channel 1 and any other channel. When, for example, the center channel is determined to be the reference channel, a first inter-channel level difference between the left channel and the center channel, a second inter-channel level difference between the right channel and the center channel, a third inter-channel level difference between the left surround channel and the center channel, and a forth inter-channel level difference between the right surround channel and the center channel are calculated. This scenario describes a five-channel scheme. When the five-channel scheme additionally includes a low frequency enhancement channel, which is also known as a “sub-woofer” channel, a fifth inter-channels level difference between the low frequency enhancement channel and the center channel, which is the single reference channel, is calculated.
When reconstructing the original multi-channel using the single down mix channel, which is also termed as the “mono” channel, and the transmitted cues such as ICLD (Interchannel Level Difference), ICTD (Interchannel Time Difference), and ICC (Interchannel Coherence), the spectral coefficients of the mono signal are modified using these cues. The level modification is performed using a positive real number determining the level modification for each spectral coefficient. The inter-channel time difference is generated using a complex number of magnitude of one determining a phase modification for each spectral coefficient. Another function determines the coherence influence. The factors for level modifications of each channel are computed by firstly calculating the factor for the reference channel. The factor for the reference channel is computed such that for each frequency partition, the sum of the power of all channels is the same as the power of the sum signal. Then, based on the level modification factor for the reference channel, the level modification factors for the other channels are calculated using the respective ICLD parameters.
Thus, in order to perform BCC synthesis, the level modification factor for the reference channel is to be calculated. For this calculation, all ICLD parameters for a frequency band are necessary. Then, based on this level modification for the single channel, the level modification factors for the other channels, i.e., the channels, which are not the reference channel, can be calculated.
This approach is disadvantageous in that, for a perfect reconstruction, one needs each and every inter-channel level difference. This requirement is even more problematic, when an error-prone transmission channel is present. Each error within a transmitted inter-channel level difference will result in an error in the reconstructed multi-channel signal, since each inter-channel level difference is required to calculate each one of the multi-channel output signal. Additionally, no reconstruction is possible, when an inter-channel level difference has been lost during transmission, although this inter-channel level difference was only necessary for e.g. the left surround channel or the right surround channel, which channels are not so important to multi-channel reconstruction, since most of the information is included in the front left channel, which is subsequently called the left channel, the front right channel, which is subsequently called the right channel, or the center channel. This situation becomes even worse, when the inter-channel level difference of the low frequency enhancement channel has been lost during transmission. In this situation, no or only an erroneous multi-channel reconstruction is possible, although the low frequency enhancement channel is not so decisive for the listeners' listening comfort. Thus, errors in a single inter-channel level difference are propagated to errors within each of the reconstructed output channels.
Additionally, the existing BCC scheme, which is also described in AES convention paper 5574, “Binaural Cue Coding applied to Stereo and Multi-channel Audio Compression”, C. Faller, F. Baumgarte, May 10 to 13, 2002, Munich, Germany, is not so well-suited, when an intuitive listening scenario is considered because of the single reference channel. It is not natural for a human being, which is, of course, the ultimate goal of the whole audio processing, that everything is related to a single reference channel. Instead, a human being has two ears, which are positioned at different sides of the human being's head. Thus, a human being's natural listening impression is, whether a signal is balanced more to the left or more to the right, or is balanced between the front and back. Contrary thereto, it is unnatural for a human being to feel whether a certain sound source in the auditory field is in a certain balance between each speaker with respect to a single reference speaker. This divergence between the natural listening impression on the one hand and the mathematical/physical model of BCC on the other hand may lead to negative consequences of the encoding scheme, when bit rate requirements, scalability requirements, flexibility requirements, reconstruction artefact requirements, or error-robustness requirements are considered.