The present invention relates to the field of audio processing and audio decoding, in particular to decoding a signal comprising transients.
Audio processing and/or decoding has advanced in many ways. In particular, spatial audio applications have become more and more important. Audio signal processing is often used to decorrelate or render signals. Moreover, decorrelation and rendering of signals is employed in the process of mono-to-stereo-upmix, mono/stereo to multi-channel upmix, artificial reverberation, stereo widening or user interactive mixing/rendering.
Several audio signal processing systems employ decorrelators. An important example is the application of decorrelating systems in parametric spatial audio decoders to restore specific decorrelation properties between two or more signals that are reconstructed from one or several downmix signals. The application of decorrelators significantly improves the perceptual quality of the output signal, e.g., when compared to intensity stereo. Specifically, the use of decorrelators enables the proper synthesis of spatial sound with a wide sound image, several concurrent sound objects and/or ambience. However, decorrelators are also known to introduce artifacts like changes in temporal signal structure, timbre, etc.
Other application examples of decorrelators in audio processing are, e.g., the generation of artificial reverberation to change the spatial impression or the use of decorrelators in multichannel acoustic echo cancellation systems to improve the convergence behavior.
A typical state of the art application of a decorrelator in a mono to stereo up-mixer, e.g. applied in Parametric Stereo (PS), is illustrated in FIG. 1, where a mono input signal M (a “dry” signal) is provided to a decorrelator 110. The decorrelator 110 decorrelates the mono input signal M according to a decorrelation method to provide a decorrelated signal D (a “wet” signal) at its output. The decorrelated signal D is fed into a mixer 120 as a first mixer input signal along with the dry mono signal M as a second mixer input signal. Furthermore an up-mix control unit 130 feeds up-mix control parameters into the mixer 120. The mixer 120 then generates two output channels L and R (L=left stereo output channel; R=right stereo output channel) according to a mixing matrix H. The coefficients of the mixing matrix can be fixed, signal dependent or controlled by a user.
Alternatively, the mixing matrix is controlled by side information that is transmitted along with the downmix containing a parametric description on how to up-mix the signals of the downmix to form the desired multi-channel output. This spatial side information is usually generated during the mono downmix process in an accordant signal encoder.
This principle is widely applied in spatial audio coding, e.g. Parametric Stereo, see, for example, J. Breebaart, S. van de Par, A. Kohlrausch, E. Schuijers, “High-Quality Parametric Spatial Audio Coding at Low Bitrates” in Proceedings of the AES 116th Convention, Berlin, Preprint 6072, May 2004.
A further typical state of the art structure of a parametric stereo decoder is illustrated in FIG. 2, wherein a decorrelation process is performed in a transform domain. An analysis filterbank 210 transforms a mono input signal into a transform domain, for example into a frequency domain. Decorrelation of the transformed mono input signal M is then performed by a decorrelator 220 which generates a decorrelated signal D. Both the transformed mono input signal M and the decorrelated signal D are fed into a mixing matrix 230. The mixing matrix 230 then generates two output signals L and R taking up-mix parameters into account, which are provided by parameter modification unit 240, which is provided with spatial parameters and which is coupled to a parameter control unit 250. In FIG. 2, the spatial parameters can be modified by a user or additional tools, e.g., post-processing for binaural rendering/presentation. In this example, the up-mix parameters are combined with the parameters from the binaural filters to form the input parameters for the up-mix matrix. Finally, the output signals generated by the mixing matrix 230 are fed into a synthesis filterbank 260, which determines the stereo output signal.
The output L/R of the mixing matrix 230 is computed from the mono input signal M and the decorrelated signal D according to a mixing rule, e.g. by applying the following formula:
      [                            L                                      R                      ]    =            [                                                  h              11                                                          h              12                                                                          h              21                                                          h              22                                          ]        ⁡          [                                    M                                                D                              ]      
In the mixing matrix, the amount of decorrelated sound fed to the output is controlled on the basis of transmitted parameters, e.g., Inter-Channel Correlation/Coherence (ICC) and/or fixed or user-defined settings.
Conceptually, the output signal of the decorrelator output D replaces a residual signal that would ideally allow for a perfect decoding of the original L/R signals. Utilizing the decorrelator output D instead of a residual signal in the upmixer results in a saving of bit rate that would otherwise have been needed to transmit the residual signal. The aim of the decorrelator is thus to generate a signal D from the mono signal M, which exhibits similar properties as the residual signal that is replaced by D.
Correspondingly, on the encoder side, two types of spatial parameters are extracted: A first group of parameters comprises correlation/coherence parameters (e.g., ICCs=Inter-Channel Correlation/Coherence parameters) representing the coherence or cross correlation between two input channels that shall be encoded. A second group of parameters comprises level difference parameters (e.g., ILDs=Inter Channel Level Difference parameters) representing the level difference between the two input channels.
Furthermore, a downmix signal is generated by downmixing the two input channels. Moreover a residual signal is generated. Residual signals are signals which can be used to regenerate the original signals by additionally employing the downmix signal and an upmix matrix. When, for example, N signals are downmixed to 1 signal, the downmix is typically 1 of the N components which result from the mapping of the N input signals. The remaining components resulting from the mapping (e.g., N−1 components) are the residual signals and allow reconstructing the original N signals by an inverse mapping. The mapping may, for example, be a rotation. The mapping shall be conducted such that the downmix signal is maximized and the residual signals are minimized, e.g., similar as a principal axis transformation. E.g., the energy of the downmix signal shall be maximized and the energies of the residual signals shall be minimized. When downmixing 2 signals to 1 signal, the downmix is normally one of the two components which result from the mapping of the 2 input signals. The remaining component resulting from the mapping is the residual signal and allows reconstructing the original 2 signals by an inverse mapping.
In some cases, the residual signal may represent an error associated with representing the two signals by their downmix and associated parameters. For example, the residual signal may be an error signal which represents the error between original channels L, R and channels L′, R′, resulting from upmixing the downmix signal that was generated based on the original channels L and R.
In other words, a residual signal can be considered as a signal in the time domain or a frequency domain or a subband domain, which together with the downmix signal alone or with the downmix signal and parametric information allows a correct or nearly correct reconstruction of an original channel. Nearly correct has to be understood that the reconstruction with the residual signal having an energy greater than zero is closer to the original channel compared to a reconstruction using the downmix without the residual signal or using the downmix and the parametric information without the residual signal.
Considering MPEG Surround (MPS), structures similar to PS termed One-To-Two boxes (OTT boxes) are employed in spatial audio decoding trees. This can be seen as a generalization of the concept of mono-to-stereo upmix to multichannel spatial audio coding/decoding schemes. In MPS, two-to-three upmix systems (TTT boxes) also exist that may apply decorrelators depending on the TTT mode of operation. Details are described in J. Herre, K. Kjörling, J. Breebaart, et al., “MPEG surround—the ISO/MPEG standard for efficient and compatible multi-channel audio coding,” in Proceedings of the 122th AES Convention, Vienna, Austria, May 2007.
Regarding Directional Audio Coding (DirAC), DirAC relates to a parametric sound field coding scheme that is not bound to a fixed number of audio output channels with fixed loudspeaker positions. DirAC applies decorrelators in the DirAC renderer, i.e., in the spatial audio decoder to synthesize non-coherent components of sound fields. More information relating to directional audio coding can be found in Pulkki, Ville: “Spatial Sound Reproduction with Directional Audio Coding,” in J. Audio Eng. Soc., Vol. 55, No. 6, 2007.
Regarding state of the art decorrelators in spatial audio decoders, reference is made to ISO/IEC International Standard “Information Technology-MPEG audio technologies—Part1: MPEG Surround”, ISO/IEC 23003-1:2007 and also to J. Engdegard, H. Purnhagen, J. Röden, L. Liljeryd, “Synthetic Ambience in Parametric Stereo Coding” in Proceedings of the AES 116th Convention, Berlin, Preprint, May 2004. IIR lattice allpass structures are used as decorrelators in spatial audio decoders like MPS as described in J. Herre, K. Kjörling, J. Breebaart, et al., “MPEG surround—the ISO/MPEG standard for efficient and compatible multi-channel audio coding,” in Proceedings of the 122th AES Convention, Vienna, Austria, May 2007, and as described in ISO/IEC International Standard “Information Technology-MPEG audio technologies—Part1: MPEG Surround”, ISO/IEC 23003-1:2007. Other state of the art decorrelators apply (potentially frequency dependent) delays to decorrelate signals or convolve the input signals, e.g., with exponentially decaying noise bursts. For an overview of state of the art decorrelators for spatial audio upmix systems, see “Synthetic Ambience in Parametric Stereo Coding” in Proceedings of the AES 116th Convention, Berlin, Preprint, May 2004.
Another technique of processing signals is “semantic upmix processing”. Semantic upmix processing is a technique to decompose signals into components with different semantic properties (i.e., signal classes) and apply different upmix strategies to the different signal components. The different upmix algorithms can be optimized according to the different semantic properties in order to improve the overall signal processing scheme. This concept is described in WO/2010/017967, An apparatus for determining a spatial output multichannel-channel audio signal, International patent application, PCT/EP2009/005828, 11.8.2009, 11.6.2010 (FH090802PCT).
A further spatial audio coding scheme is the “temporal permutation method”, as described in Hotho, G., van de Par, S., and Breebaart, J.: “Multichannel coding of applause signals”, EURASIP Journal on Advances in Signal Processing, January 2008, art. 10. DOI=http://dx.doi.org/10.1155/2008/. In this document, a spatial audio coding scheme is proposed that is tailored to the coding/decoding of applause-like signals. This scheme relies on the perceptual similarity of segments of a monophonic audio signal, esp. a downmix signal of a spatial audio coder. The monophonic audio signal is segmented into overlapping time segments. These segments are temporarily permuted pseudo randomly (mutually independent for n output channels) within a “super”-block to form the decorrelated output channels.
A further spatial audio coding technique is the “temporal delay and swapping method”. In DE 10 2007 018032 A: 20070417, Erzeugung dekorrelierter Signale, 17.4.2007, 23.10.2008 (FH070414PDE), a scheme is proposed that is also tailored to the coding/decoding of applause-like signals for binaural presentation. This scheme also relies on the perceptual similarity of segments of a monophonic audio signal and delays on output channels with respect to the other one. In order to avoid a localization bias towards the leading channel, leading and lagging channel are swapped periodically.
In general, stereo or multichannel applause-like signals coded/decoded in parametric spatial audio coders are known to result in reduced signal quality (see, for example, Hotho, G., van de Par, S., and Breebaart, J.: “Multichannel coding of applause signals”, EURASIP Journal on Advances in Signal Processing, January 2008, art. 10. DOI=http://dx.doi.org/10.1155/2008/531693, see also DE 10 2007 018032 A). Applause-like signals are characterized by containing temporarily dense mixtures of transients from different directions. Examples for such signals are applause, the sound of rain, galloping horses, etc. Applause-like signals often also contain sound components from distant sound sources, that are perceptually fused into a noise-like, smooth, background sound field.
State of the art decorrelation techniques employed in spatial audio decoders like MPEG Surround contain lattice allpass structures. These act as artificial reverb generators and are consequently well suited for generating homogeneous, smooth, noise-like, immersive sounds (like room reverberation tails). However, there are examples of sound fields with a non-homogeneous spatio-temporal structure that are still immersing the listener: one prominent example are applause-like sound fields that create listener-envelopment not only by homogeneous noise-like fields, but also by rather dense sequences of single claps from different directions. Hence, the non-homogeneous component of applause sound fields may be characterized by a spatially distributed mixture of transients. Obviously, these distinct claps are not homogeneous, smooth and noise-like at all.
Due to their reverb-like behavior, lattice allpass decorrelators are incapable of generating immersive sound field with the characteristics, e.g., of applause. Instead, when applied to applause-like signals, they tend to temporarily smear the transients in the signals. The undesired result is a noise-like immersive sound field without the distinctive spatio-temporal structure of applause-like sound fields. Further, transient events like a single handclap might evoke ringing artifacts of the decorrelator filters.
A system according to Hotho, G., van de Par, S., and Breebaart, J.: “Multichannel coding of applause signals”, EURASIP Journal on Advances in Signal Processing, January 2008, art. 10. DOI=http://dx.doi.org/10.1155/2008/531693, will exhibit perceivable degradation of the output sound due to a certain repetitive quality in the output audio signal. This is because of the fact that one and the same segment of the input signal appears unaltered in every output channel (though at a different point in time). Furthermore, to avoid increased applause density, some original channels have to be dropped in the upmix and thus some important auditory event might be missed in the resulting upmix. The method is only applicable if it is possible to find signal segments that share the same perceptual properties, i.e.: signal segments that sound similar. The method in general heavily changes the temporal structure of the signals, which might be acceptable only for very few signals. In the case of applying the scheme to non-applause-like signals (e.g., due to signal misclassification), the temporal permutation will most often lead to unacceptable results. The temporal permutation further limits the applicability to cases where several signal segments may be mixed together without artifacts like echoes or comb-filtering. Similar drawbacks apply to the method described in DE 10 2007 018032 A.
The semantic upmix processing described in WO/2010/017967 separates the transient components of signals prior to the application of decorrelators. The remaining (transient-free) signal is fed to the conventional decorrelation and upmix processor, whereas the transient signals are handled differently: the latter are (e.g., randomly) distributed to different channels of the stereo or multichannel output signal by application of amplitude panning techniques. The amplitude panning shows several disadvantages:
Amplitude panning does not necessarily produce an output signal that is close to the original. The output signal may be only close to the original if the distribution of the transients in the original signal can be described by amplitude panning laws. I.e.: The amplitude panning can only reproduce purely amplitude panned events correctly, but no phase or time differences between the transient components in different output channels.
Moreover, application of the amplitude panning approach in MPS would need bypassing not only the decorrelator but also the upmix matrix. Since the upmix matrix reflects the spatial parameters (inter channel correlations: ICCs, inter channel level differences: ILDs) that are needed to synthesize an upmix output that shows the correct spatial properties, the panning system itself has to apply some rule to synthesize output signals with the correct spatial properties. A generic rule for doing so is not known. Further, this structure adds complexity since the spatial parameters have to be taken care of twice: once, for the non-transient part of the signal and, second, for the amplitude-panned transient part of the signal.