Recent development in audio coding has made available the ability to recreate a multi-channel representation of an audio signal based on a stereo (or mono) signal and corresponding control data. These methods differ substantially from older matrix based solutions such as Dolby Prologic, since additional control data is transmitted to control the recreation, also referred to as up-mix, of the surround channels based on the transmitted mono or stereo channels.
Hence, the parametric multi-channel audio decoders reconstruct N channels based on M transmitted channels, where N>M, and based on the additional control data. The additional control data represents a significant lower data rate than transmitting all N channels, making the coding very efficient while at the same time ensuring compatibility with both M channel devices and N channel devices. The M channels can either be a single mono, a stereo, or a 5.1 channel representation. Hence, it is possible to have e.g. a 7.2 channel original signal down mixed to a 5.1 channel backwards compatible signal, and spatial audio parameters enabling a spatial audio decoder to re-produce a closely resembling version of the original 7.2 channels, at a small additional bit rate overhead.
These parametric surround-coding methods usually comprise a parameterisation of the surround signal based on ILD (Inter channel Level Difference) and ICC (Inter Channel Coherence). These parameters describe e.g. power ratios and correlation between channel pairs of the original multi-channel signal. In the decoding process, the re-created multi-channel signal is obtained by distributing the energy of the received downmix channels between all the channel pairs described by the transmitted ILD parameters. However, since a multi-channel signal can have equal power distribution between all channels, while the signals in the different channels are very different, thus giving the listening impression of a very wide (diffuse) sound, the correct wideness (diffuseness) is obtained by mixing the signals with decorrelated versions of the same. This mixing is described by the ICC parameter. The decorrelated version of the signal is obtained by passing the signal through an all-pass filter such as a reverberator.
This means that the decorrelated version of the signal is created on the decoder side and is not, like the downmix channels, transmitted from the encoder to the decoder. The output signals from the all-pass filters (decorrelators) have a time-response that is usually very flat. Hence, a dirac input signal gives a decaying noise-burst out. Therefore, when mixing the decorrelated and the original signal, it is for some signal types such as dense transients (applause signals) important to shape the time envelope of the decorrelated signal to better match that of the down-mix channel, which is often also called dry signal. Failing to do so will result in a perception of larger room size and unnatural sounding transient signals. Having transient signals and a reverberator as all-pass filter, even echo-type artefacts can be introduced when shaping of the decorrelated (wet) signals is omitted.
From a technical point of view, one of the key challenges in reconstructing multi-channel signals, as for example within a MPEG sound synthesis, consists in the proper reproduction of multi-channel signals with a very wide sound image. Technically speaking, this corresponds to the generation of several signals with low inter-channel correlation (or coherence), but still tightly control spectral and temporal envelopes. Examples for such signals are “applause” items, which exhibit both a high degree of decorrelation and sharp transient events (claps). As a consequence, these items are most critical for the MPEG surround technology which is for example elaborated in more detail in the “Report on MPEG Spatial Audio Coding RMO Listening Tests”, ISO/IEC JTC1/SC29/WG11 (MPEG), Document N7138, Busan, Korea, 2005”. Generally previous work has focussed on a number of aspects relating to the optimal reproduction of wide/diffuse signals, such as applause by providing solutions that                1. adapt the temporal (and spectral) shape of the decorrelated signal to that of the transmitted downmix signal in order to prevent pre-echo-like artefacts (note: this does not require sending any side information from the spatial audio encoder to the spatial audio decoder).        2. adapt the temporal envelopes of the synthesized output channels to their original envelope shapes (present at the input of the corresponding encoder) using side information that describes the temporal envelopes of the original input signals and which is transmitted from the spatial audio encoder to the spatial audio decoder.        
Currently, the MPEG Surround Reference Model already contains several tools supporting the coding of such signals, e.g.                Time Domain Temporal Shaping (TP)        Temporal Envelope Shaping (TES)        
In an MPEG Surround synthesis system, decorrelated sound is generated and mixed with the “dry” signal in order to control the correlation of the synthesized output channels according to the transmitted ICC values. From here onwards, the decorrelated signal will be referred to as ‘diffuse’ signal, although the term ‘diffuse’ reflects properties of the reconstructed spatial sound field rather than properties of a signal itself. For transient signals, the diffuse sound generated in the decoder does not automatically match the fine temporal shape of the dry signals and does not fuse well perceptually with the dry signal. This results in poor transient reproduction, in analogy to the “pre-echo problem” which is known from perceptual audio coding. The TP tool implementing Time Domain Temporal Shaping is designed to address this problem by processing of the diffuse sound.
The TP tool is applied in the time domain, as illustrated in FIG. 14. It basically consists of a temporal envelope estimation of dry and diffuse signals with a higher temporal resolution than that provided by the filter bank of a MPEG Surround coder. The diffuse signal is re-scaled in its temporal envelope to match the envelope of the dry signal. This results in a significant increase in sound quality for critical transient signals with a broad spatial image/low correlation between channel signals, such as applause.
The envelope shaping (adjusting the temporal evolution of the energy contained within a channel) is done by matching the normalized short time energy of the wet signal to that one of the dry signal. This is achieved by means of a time varying gain function that is applied to the diffuse signal, such that the time envelope of the diffuse signal is shaped to match that one of the dry signal.
Note that this does not require any side information to be transmitted from the encoder to the decoder in order to process the temporal envelope of the signal (only control information for selectively enabling/disabling TP is transmitted by the surround encoder).
FIG. 14 illustrates the time domain temporal shaping, as applied within MPEG surround coding. A direct signal 10 and a diffuse signal 12 which is to be shaped are the signals to be processed, both supplied in a filterbank domain. Within MPEG surround, optionally a residual signal 14 may be available that is added to the direct signal 10 still within the filter bank domain. In the special case of an MPEG surround decoder, only high frequency parts of the diffuse signal 12 are shaped, therefore the low-frequency parts 16 of the signal are added to the direct signal 10 within the filter bank domain.
The direct signal 10 and the diffuse signal 12 are separately converted into the time domain by filter bank synthesis devices 18a, and 18b. The actual time domain temporal shaping is performed after the synthesis filterbank. Since only the high-frequency parts of the diffuse signal 12 are to be shaped, the time domain representations of the direct signal 10 and the diffuse signal 12 are input into high pass filters 20a and 20b that guarantee that only the high-frequency portions of the signals are used in the following filtering steps. A subsequent spectral whitening of the signals may be performed in spectral whiteners 22a and 22b to assure that the amplitude (energy) ratios of the full spectral range of the signals are accounted for in the following envelope estimation 24 which compares the ratio of the energies that are contained in the direct signal and in the diffuse signal within a given time portion. This time portion is usually defined by the frame length. The envelope estimation 24 has as an output a scale factor 26, that is applied to the diffuse signal 12 in the envelope shaping 28 in the time domain to guarantee that the signal envelope is basically the same for the diffuse signal 12 and the direct signal 10 within each frame.
Finally, the envelope shaped diffuse signal is again high-pass filtered by a high-pass filter 29 to guarantee that no artefacts of lower frequency bands are contained in the envelope shaped diffuse signal. The combination of the direct signal and the diffuse signal is performed by an adder 30. The output signal 32 then contains signal parts of the direct signal 10 and of the diffuse signal 12, wherein the diffuse signal was envelope shaped to assure that the signal envelope is basically the same for the diffuse signal 12 and the direct signal 10 before the combination.
The problem of precise control of the temporal shape of the diffuse sound can also be addressed by the so-called Temporal Envelope Shaping (TES) tool, which is designed to be a low complexity alternative to the Temporal Processing (TP) tool. While TP operates in the time domain by a time-domain scaling of the diffuse sound envelope, the TES approach achieves the same principal effect by controlling the diffuse sound envelope in a spectral domain representation. This is done similar to the Temporal Noise Shaping (TNS) approach, as it is known from MPEG-2/4 Advanced Audio Coding (AAC). Manipulation of the diffuse sound fine temporal envelope is achieved by convolution of its spectral coefficients across frequency with a suitable shaping filter derived from an LPC analysis of spectral coefficients of the dry signal. Due to the quite high time resolution of the MPEG Surround filter bank, TES processing requires only low-order filtering (1st order complex prediction) and is thus low in its computational complexity. On the other hand, due to limitations e.g. related to temporal aliasing, it cannot provide the full extent of temporal control that the TP tool offers.
Note that, similarly to the case of TP, TES does not require any side information to be transmitted from the encoder to the decoder in order to describe the temporal envelope of the signal.
Both tools, TP and TES, successfully address the problem of temporal shaping of the diffuse sound by adapting its temporal shape to that of the transmitted down mix signal. While this avoids the pre-echo type of unmasking, it cannot compensate for a second type of deficiency in the multi-channel output signal, which is due to the lack of spatial redistribution:
An applause signal consists of a dense mixture of transient events (claps) several of which typically fall into the same parameter frame. Clearly, not all claps in a frame originate from the same (or similar) spatial direction. For the MPEG Surround decoder, however, the temporal granularity of the decoder is largely determined by the frame size and the parameter slot temporal granularity. Thus, after synthesis, all claps that fall into a frame appear with the same spatial orientation (level distribution between output channels) in contrast to the original signal for which each clap may be localized (and, in fact, perceived) individually.
In order to also achieve good results in terms of spatial redistribution of highly critical signals such as applause signals, the time-envelopes of the upmixed signal need to be shaped with a very high time resolution.