Recent development in audio coding enables recreation of a multi-channel representation of an audio signal based on a stereo (or mono) signal and corresponding control data. These methods differ substantially from older matrix based solutions since additional control data is transmitted to control the recreation, also referred to as up-mix, of the surround channels based on the transmitted mono or stereo channels. Such parametric multi-channel audio decoders reconstruct N channels based on M transmitted channels, where N>M, and the additional control data. Using the additional control data causes a significantly lower data rate than transmitting all N channels, making the coding very efficient, while at the same time ensuring compatibility with both M channel devices and N channel devices. The M channels can either be a single mono channel, a stereo channel, or a 5.1 channel representation. Hence, it is possible to have an 7.2 channel original signal, downmixed to a 5.1 channel backwards compatible signal, and spatial audio parameters enabling a spatial audio decoder to reproduce a closely resembling version of the original 7.2 channels, at a small additional bit rate overhead.
These parametric surround coding methods usually comprise a parameterization of the surround signal based on time and frequency variant ILD (Inter Channel Level Difference) and ICC (Inter Channel Coherence) parameters. These parameters describe e.g. power ratios and correlations between channel pairs of the original multi-channel signal. In the decoding process, the re-created multichannel signal is obtained by distributing the energy of the received downmix channels between all the channel pairs as described by the transmitted ILD parameters. However, since a multi-channel signal can have equal power distribution between all channels, while the signals in the different channels are very different, thus giving the listening impression of a very wide sound, the correct wideness is obtained by mixing signals with decorrelated versions of the same, as described by the ICC parameter.
The decorrelated version of the signal, often also referred to as wet or diffuse signal, is obtained by passing the signal through a reverberator, such as an all-pass filter. A simple form of decorrelation is applying a specific delay to the signal. Generally, there are a lot of different reverberators known in the art, the precise implementation of the reverberator used is of minor importance.
The output from the decorrelator has a time response that is usually very flat. Hence, a dirac input signal gives a decaying noise burst out. When mixing the decorrelated and the original signal, it is for some transient signal types, like applause signals, important to perform some post-processing on the signal to avoid perceptuality of additionally introduced artefacts that may result in a larger perceived room size and pre-echo type of artefacts.
Generally, the invention relates to a system that represents multi-channel audio as a combination of audio downmix data (e.g. one or two channels) and related parametric multi-channel data. In such a scheme (for example in binaural cue coding) an audio downmix data stream is transmitted, wherein it may be noted that the simplest form of downmix is simply adding the different signals of a multi-channel signal. Such a signal (sum signal) is accompanied by a parametric multi-channel data stream (side info). The side info comprises for example one or more of the parameter types discussed above to describe the spatial interrelation of the original channels of the multi-channel signal. In a sense, the parametric multi-channel scheme acts as a pre-/post-processor to the sending/receiving end of the downmix data, e.g. having the sum signal and the side information. It shall be noted that the sum signal of the downmix data may additionally be coded using any audio or speech coder.
As transmission of multi-channel signals over low-bandwidth carriers is becoming more and more popular these systems, also known under “spatial audio coding”, “MPEG surround”, have been well developed recently.
The following publications are known in the context of these technologies:    [1] C. Faller and F. Baumgarte, “Efficient representation of spatial audio using perceptual parametrization,” in Proc. IEEE WASPAA, Mohonk, N.Y., October. 2001.    [2] F. Baumgarte and C. Faller, “Estimation of auditory spatial cues for binaural cue coding,” in Proc. ICASSP 2002, Orlando, Fla. May 2002.    [3] C. Faller and F. Baumgarte, “Binaural cue coding: a novel and efficient representation of spatial audio,” in Proc. ICASSP 2002, Orlando, Fla., May 2002.    [4] F. Baumgarte and C. Faller, “Why binaural cue coding is better than intensity stereo coding,” in Proc. AES 112th Conv., Munich, Germany, May 2002.    [5] C. Faller and F. Baumgarte, “Binaural cue coding applied to stereo and multi-channel audio compression,” in Proc. AES 112th Conv., Munich, Germany, May 2002.    [6] F. Baumgarte and C. Faller, “Design and evaluation of binaural cue coding,” in AES 113th Conv., Los Angeles, Calif., October 2002.    [7] C. Faller and F. Baumgarte, “Binaural cue coding applied to audio compression with flexible rendering,” in Proc. AES 113th Conv., Los Angeles, Calif., October 2002.    [8] J. Breebaart, J. Herre, C. Faller, J. Röd{acute over (b)}n, F. Myburg, S. Disch, H. Purnhagen, G. Hoto, M. Neusinger, K. Kjörling, W. Oomen: “MPEG Spatial Audio Coding/MPEG Surround: Overview and Current Status”, 119th AES Convention, New York 2005, Preprint 6599    [9] J. Herre, H. Purnhagen, J. Breebaart, C. Faller, S. Disch, K. Kjörling, E. Schuijers, J. Hilpert, F. Myburg, “The Reference Model Architecture for MPEG Spatial Audio Coding”, 118th AES Convention, Barcelona 2005, Preprint 6477    [10] J. Herre, C. Faller, S. Disch, C. Ertel, J. Hilpert, A. Hoelzer, K. Linzmeier, C. Spenger, P. Kroon: “Spatial Audio Coding: Next-Generation Efficient and Compatible Coding of Multi-Channel Audio”, 117th AES Convention, San Francisco 2004, Preprint 6186    [11] J. Herre, C. Faller, C. Ertel, J. Hilpert, A Hoelzer, C. Spenger: “MP3 Surround: Efficient and Compatible Coding of Multi-Channel Audio”, 116th AES Convention, Berlin 2004, Preprint 6049.
A related technique, focusing on transmission of two channels via one transmitted mono signal is called “parametric stereo” and for example described more extensively in the following publications:    [12] J. Breebaart, S. van de Par, A. Kohlrausch, E. Schuijers, “High-Quality Parametric Spatial Audio Coding at Low Bitrates”, AES 116th Convention, Berlin, Preprint 6072, May 2004    [13] E. Schuijers, J. Breebaart, H. Purnhagen, J. Engdegard, “Low Complexity Parametric Stereo Coding”, AES 116th Convention, Berlin, Preprint 6073, May 2004.
In a spatial audio decoder, the multi-channel upmix is computed from a direct signal part and a diffuse signal part, which is derived by means of decorrelation from the direct part, as already mentioned above. Thus, in general, the diffuse part has a different temporal envelope than the direct part. The term “temporal envelope” describes in this context the variation of the energy or amplitude of the signal with time. The differing temporal envelope leads to artifacts (pre- and post-echoes, temporal “smearing”) in the upmix signals for input signals that have a wide stereo image and, at the same time, a transient envelope structure. Transient signals generally are signals that are varying strongly in a short time period.
The probably most important examples for this class of signals are applause-like signals, which are frequently present in live recordings.
In order to avoid artefacts caused by introducing diffuse/decorrelated sound with an inappropriate temporal envelope into the upmix signal, a number of techniques have been proposed:
The U.S. application Ser. No. 11/006,492 (“Diffuse Sound Shaping for BCC Schemes and The Like”) shows that the perceptual quality of critical transient signals can be improved by shaping the temporal envelope of the diffuse signal to match the temporal envelope of the direct signal.
This approach has already been introduced into MPEG surround technology by different tools, such as “temporal envelope shaping” (TES) and the “temporal processing” (TP). Since the target temporal envelope of the diffuse signal is derived from the envelope of the transmitted downmix signal, this method does not require additional side information to be transmitted. However, as a consequence, the temporal fine structure of the diffuse sound is the same for all output channels. As the direct signal part, which is directly derived from the transmitted downmix signal, does also have a similar temporal envelope, this method may improve the perceptual quality of applause-like signals in terms of “crisp-ness”, i.e. However, as then the direct signal and diffuse signal have similar temporal envelopes for all channels, such techniques may enhance the subjective quality of applause-like signals but cannot improve the spatial distribution of single applause events in the signal, as this would only be possible, when one reconstructed channel would be much more intense at the occurrence of the transient signal than the other channels, which is impossible having signals sharing basically the same temporal envelope.
An alternative method to overcome the problem is described by U.S. application Ser. No. 11/006,482 (“individual Channel Shaping for BCC Schemes and The Like”). This approach employs fine-grain temporal broad band side information that is transmitted by the encoder to perform a fine temporal shaping of both the direct and the diffuse signal. Evidently, this approach allows a temporal fine structure that is individual for each output channel and thus is able to accommodate also signals for which transient events occur in only a subset of the output channels. A further variation of this approach is described in U.S. 60/726,389 (“Methods for Improved Temporal and Spatial Shaping of Multi-Channel Audio Signals”). Both discussed approaches to enhance perceptual quality of transient coded signals comprise a temporal shaping of the envelope of the diffuse signal intended to match a corresponding direct signals temporal envelope.
While both previously described prior-art methods can enhance the subjective quality of applause-like signals in terms of crisp-ness, only the latter approach can also improve the spatial redistribution of the reconstructed signal. Still, the subjective quality of the synthesized applause signals remains unsatisfactory, because the temporal shaping of both the combination of dry and diffused sound leads to characteristic distortions (the attacks of the individual claps are either perceived as not “tight” when only a loose temporal shaping is performed, or distortions are introduced if shaping with a very high temporal resolution is applied to the signal). This becomes evident, when a diffuse signal is simply a delayed copy of the direct signal. Then, the diffused signal mixed to the direct signal is likely to have a different spectral composition than the direct signal. Thus, even if the envelope is scaled to match the envelope of the direct signal, different spectral contributions, not originating directly from the original signal will be present in the reconstructed signal. The introduced distortions may become even worse, when the diffuse signal part is emphasized (made louder) during the reconstruction, when the diffuse signal is scaled to match the envelope of the direct signal.