1. Field of the Invention
The present invention relates to coding of multi-channel representations of audio signals using spatial parameters. The present invention teaches new methods for estimating and defining proper parameters for recreating a multi-channel signal from a number of channels being less than the number of output channels. In particular it aims at minimizing the bit rate for the multi-channel representation, and providing a coded representation of the multi-channel signal enabling easy encoding and decoding of the data for all possible channel-configurations.
2. Description of the Related Art
It has been shown in PCT/SE02/01372 “Efficient and scalable Parametric Stereo Coding for Low Bit rate Audio Coding Applications”, that it is possible to re-create a stereo image that closely resembles the original stereo image, from a mono signal given a very compact representation of the stereo image. The basic principle is to divide the input signal into frequency bands and time segments, and for these frequency bands and time segments, estimate inter-channel intensity difference (IID), and inter-channel coherence (ICC). The first parameter is a measurement of the power distribution between the two channels in the specific frequency band and the second parameter is an estimation of the correlation between the two channels for the specific frequency band. On the decoder side the stereo image is recreated from the mono signal by distributing the mono signal between the two output channels in accordance with the IID-data, and by adding a decorrelated signal in order to retain the channel correlation of the original stereo channels.
For a multi-channel case (multi-channel in this context meaning more than two output channels), several additional issues have to be accounted for. Several multi-channel configurations exist. The most commonly known is the 5.1 configuration (center channel, front left/right, surround left/right, and the LFE channel). However, many other configurations exist. From the complete encoder/decoder systems point-of-view, it is desirable to have a system that can use the same parameter set (e.g. IID and ICC) or subsets thereof for all channel configurations. ITU-R BS.775 defines several down-mix schemes to be able to obtain a channel configuration comprising fewer channels from a given channel configuration. Instead of always having to decode all channels and rely on a down-mix, it can be desirable to have a multi-channel representation that enables a receiver to extract the parameters relevant for the channel configuration at hand, prior to decoding the channels. Further, a parameter set that is inherently scaleable is desirable from a scalable or embedded coding point of view, where it is e.g. possible to store the data corresponding to the surround channels in an enhancement layer in the bitstream.
Contrary to the above it can also be desirable to be able to use different parameter definitions based on the characteristics of the signal being processed, in order to switch between the parameterization that results in the lowest bit rate overhead for the current signal segment being processed.
Another representation of multi-channel signals using a sum signal or down mix signal and additional parametric side information is known in the art as binaural cue coding (BCC). This technique is described in “Binaural Cue Coding—Part 1: Psycho-Acoustic Fundamentals and Design Principles”, IEEE Transactions on Speech and Audio Processing, vol. 11, No. 6, November 2003, F. Baumgarte, C. Faller, and “Binaural Cue Coding. Part II: Schemes and Applications”, IEEE Transactions on Speech and Audio Processing vol. 11, No. 6, November 2003, C. Faller and F. Baumgarte.
Generally, binaural cue coding is a method for multi-channel spatial rendering based on one down-mixed audio channel and side information. Several parameters to be calculated by a BCC encoder and to be used by a BCC decoder for audio reconstruction or audio rendering include inter-channel level differences, inter-channel time differences, and inter-channel coherence parameters. These inter-channel cues are the determining factor for the perception of a spatial image. These parameters are given for blocks of time samples of the original multi-channel signal and are also given frequency-selective so that each block of multi-channel signal samples have several cues for several frequency bands. In the general case of C playback channels, the inter-channel level differences and the inter-channel time differences are considered in each subband between pairs of channels, i.e., for each channel relative to a reference channel. One channel is defined as the reference channel for each inter-channel level difference. With the inter-channel level differences and the inter-channel time differences, it is possible to render a source to any direction between one of the loudspeaker pairs of a playback set-up that is used. For determining the width or diffuseness of a rendered source, it is enough to consider one parameter per subband for all audio channels. This parameter is the inter-channel coherence parameter. The width of the rendered source is controlled by modifying the subband signals such that all possible channel pairs have the same inter-channel coherence parameter.
In BCC coding, all inter-channel level differences are determined between the reference channel 1 and any other channel. When, for example, the center channel is determined to be the reference channel, a first inter-channel level difference between the left channel and the centre channel, a second inter-channel level difference between the right channel and the centre channel, a third inter-channel level difference between the left surround channel and the center channel, and a forth inter-channel level difference between the right surround channel and the center channel are calculated. This scenario describes a five-channel scheme. When the five-channel scheme additionally includes a low frequency enhancement channel, which is also known as a “sub-woofer” channel, a fifth inter-channels level difference between the low frequency enhancement channel and the center channel, which is the single reference channel, is calculated.
When reconstructing the original multi-channel using the single down mix channel, which is also termed as the “mono” channel, and the transmitted cues such as ICLD (Interchannel Level Difference), ICTD (Interchannel Time Difference), and ICC (Interchannel Coherence), the spectral coefficients of the mono signal are modified using these cues. The level modification is performed using a positive real number determining the level modification for each spectral coefficient. The inter-channel time difference is generated using a complex number of magnitude of one determining a phase modification for each spectral coefficient. Another function determines the coherence influence. The factors for level modifications of each channel are computed by firstly calculating the factor for the reference channel. The factor for the reference channel is computed such that for each frequency partition, the sum of the power of all channels is the same as the power of the sum signal. Then, based on the level modification factor for the reference channel, the level modification factors for the other channels are calculated using the respective ICLD parameters.
Thus, in order to perform BCC synthesis, the level modification factor for the reference channel is to be calculated. For this calculation, all ICLD parameters for a frequency band are necessary. Then, based on this level modification for the single channel, the level modification factors for the other channels, i.e., the channels, which are not the reference channel, can be calculated.
This approach is disadvantageous in that, for a perfect reconstruction, one needs each and every inter-channel level difference. This requirement is even more problematic, when an error-prone transmission channel is present. Each error within a transmitted inter-channel level difference will result in an error in the reconstructed multi-channel signal, since each inter-channel level difference is required to calculate each one of the multi-channel output signal. Additionally, no reconstruction is possible, when an inter-channel level difference has been lost during transmission, although this inter-channel level difference was only necessary for e.g. the left surround channel or the right surround channel, which channels are not so important to multi-channel reconstruction, since most of the information is included in the front left channel, which is subsequently called the left channel, the front right channel, which is subsequently called the right channel, or the center channel. This situation becomes even worse, when the inter-channel level difference of the low frequency enhancement channel has been lost during transmission. In this situation, no or only an erroneous multi-channel reconstruction is possible, although the low frequency enhancement channel is not so decisive for the listeners' listening comfort. Thus, errors in a single inter-channel level difference are propagated to errors within each of the reconstructed output channels.
Parametric multi-channel representations are problematic in that, normally, inter-channel level differences such as ICLDs in BCC coding or balance values in other parametric multi-channel representations are given as relative values rather than absolute values. In BCC, an ICLD parameter describes the level difference between a channel and a reference channel. Balance values can also be given as a ratio between two channels in a channel pair. When reconstructing the multi-channel signal, such level differences or balance parameters are applied to a base channel, which can be a mono base channel or a stereo base channel signal having two base channels. Thus, the energy included in the at least one base channel is distributed among the for example five or six reconstructed output channels. Thus, the absolute energy in a reconstructed output channel is determined by the inter-channel level difference or the balance parameter and the energy of the down-mix signal at the receiver input.
When there come situations, in which the energy of the down-mix signal at the receiver input varies with respect to a down-mix signal output by an encoder, level variations will occur. In this context, it is to be emphasized that, depending on the used parameterization scheme, such level variations will not only result in a general loudness variation of the constructed signal, but can also result in serious artefacts, when the parameters are given frequency-selective. When, for example, a certain frequency band of the down-mix signal is manipulated more than a frequency band at another place on the frequency scale, this manipulation will be readily apparent in the reconstructed output signal, since the frequency components in the output channel in the certain frequency band have a level, which is too low or too high
Additionally, timely varying level manipulations will also result in an overall level of the reconstructed output signal, which is varying over time and is, therefore, perceived as an annoying artefact.
While the above situations concentrated on level manipulations resulting by encoding, transmitting, and decoding a down-mix signal, other level deviations can occur. Due to phase dependencies between different channels being down-mixed into one or two channels, a situation can occur, in which the mono signal has an energy, which is not equal to the sum of the energies in the original signal. Since the down-mix is normally performed sample-wise, i.e., by adding time wave forms, a phase difference between the left signal and the right signal of for example 180 degrees will result in a complete cancellation of both channels in the down-mix signal, which would result in a zero energy, although both signals have, of course, a certain signal energy. Although in normal situations such an extreme situation will not be very probable, energy variations still occur, since all signals are, of course, not completely uncorrelated. Such variations can also result in loudness fluctuations in the reconstructed output signal and will also result in artefacts, since the energy of the reconstructed output signal will be different from the energy of the original multi-channel signal.