Recent development in audio coding enables one to recreate a multi-channel representation of an audio signal based on a stereo (or mono) signal and corresponding control data. These methods differ substantially from older matrix based solutions, such as Dolby Prologic, since additional control data is transmitted to control the recreation, also referred to as up-mix, of the surround channels based on the transmitted mono or stereo channels. Such parametric multi-channel audio decoders reconstruct N channels based on M transmitted channels, where N>M, and the additional control data. Using the additional control data causes a significantly lower data rate than transmitting all N channels, making the coding very efficient, while at the same time ensuring compatibility with both M channel devices and N channel devices. The M channels can either be a single mono channel, a stereo channel, or a 5.1 channel representation. Hence, it is possible to have an 7.2 channel original signal, downmixed to a 5.1 channel backwards compatible signal, and spatial audio parameters enabling a spatial audio decoder to reproduce a closely resembling version of the original 7.2 channels, at a small additional bit rate overhead.
These parametric surround coding methods usually comprise a parameterisation of the surround signal based on time and frequency variant ILD (Inter Channel Level Difference) and ICC (Inter Channel Coherence) quantities. These parameters describe e.g. power ratios and correlations between channel pairs of the original multi-channel signal. In the decoder process, the re-created multichannel signal is obtained by distributing the energy of the received downmix channels between all the channel pairs described by the transmitted ILD parameters. However, since a multi-channel signal can have equal power distribution between all channels, while the signals in the different channels are very different, thus giving the listening impression of a very wide sound, the correct wideness is obtained by mixing signals with decorrelated versions of the same, as described by the ICC parameter.
The decorrelated version of the signal, often referred to as wet signal, is obtained by passing the signal (also called dry signal) through a reverberator, such as an all-pass filter. The output from the decorrelator has a time-response that is usually very flat. Hence, a dirac input signal gives a decaying noise-burst out. When mixing the decorrelated and the original signal it is for some transient signal types, like applause signals, important to shape the time envelope of the decorrelated signal to better match that one of the dry signal. Failing to do so will result in a perception of larger room size and unnatural sounding transients due to pre-echo type of artifacts.
In systems where the multi-channel reconstruction is done in a frequency transform domain having a low time resolution, temporal envelope shaping techniques can be employed, similarly to those used for shaping quantization noise such as Temporal Noise Shaping [J. Herre and J. D. Johnston, “Enhancing the performance of perceptual audio coding by using temporal noise shaping (TNS),” in 101st AES Convention, Los Angeles, November 1996] of perceptual audio codecs like MPEG-4 AAC. This is accomplished by means of prediction across frequency bins, where the temporal envelope is estimated by linear prediction in the frequency direction on the dry signal, and the filter obtained is applied, again in the frequency direction, on the wet signal.
One may for example consider a delay line as decorrelator and a strongly transient signal, such as applause or a gun-shot, as signal to be up-mixed. When no envelope shaping would be performed, a delayed version of the signal would be combined with the original signal to reconstruct a stereo or multi-channel signal. Such, the transient signal would be present twice in the up-mixed signal, separated by the delay time, causing an unwanted echo type effect.
In order to achieve good results on highly critical signals, the time-envelope of the decorrelated signal needs to be shaped with a very high time resolution, such cancelling out a delayed echo of a transient signal or masking it by reducing its energy to the energy contained in the carrier channel at the time.
This broad band gain adjustment of the decorrelated signal can be done over windows as short as 1 ms [U.S. patent application, “Diffuse Sound Shaping for BCC Schemes and the Like”, Ser. No. 11/006,492, Dec. 7, 2004]. Such high time-resolutions of the gain adjustment for the decorrelated signal inevitably leads to additional distortion. In order to minimise the added distortion for non-critical signals, i.e. where the temporal shaping of the decorrelated signal is not crucial, detection mechanism are incorporated in the encoder or decoder, that switch the temporal shaping algorithm on and off, according to some sort of pre-defined criteria. The drawback is that the system can become extremely sensitive to detector tuning.
Throughout the following description the term decorrelated signal or wet signal is used for the, possibly gain adjusted (according to the ILD and ICC parameters) decorrelated version of a downmix signal, and the term downmix signal, direct signal or dry signal is used for the, possibly gain adjusted downmix signal.
In prior art implementations, a high time-resolution gain adjustment, i.e. a gain adjustment based on samples of the dry signal as short as milliseconds, leads to an additional significant distortion for non-critical signals. These are non-transient signals having a smooth timely evolution, for example music signals. The prior art approach of switching the gain adjustment off for such non-critical signals introduces a new and strong dependency of the quality of audio perception on the detection mechanism, which is, of course, mostly disadvantageous and may even introduce additional distortion, when the detection fails.