Audio processing and/or coding has advanced in many ways. More and more demand is generated for spatial audio applications. In many applications audio signal processing is utilized to decorrelate or render signals. Such applications may, for example, carry out mono-to-stereo up-mix, mono/stereo to multi-channel up-mix, artificial reverberation, stereo widening or user interactive mixing/rendering.
For certain classes of signals as e.g. noise-like signals as for instance applause-like signals, conventional methods and systems suffer from either unsatisfactory perceptual quality or, if an object-orientated approach is used, high computational complexity due to the number of auditory events to be modeled or processed. Other examples of audio material, which is problematic, are generally ambience material like, for example, the noise that is emitted by a flock of birds, a sea shore, galloping horses, a division of marching soldiers, etc.
Conventional concepts use, for example, parametric stereo or MPEG-surround coding (MPEG=Moving Pictures Expert Group). FIG. 6 shows a typical application of a decorrelator in a mono-to-stereo up-mixer. FIG. 6 shows a mono input signal provided to a decorrelator 610, which provides a decorrelated input signal at its output. The original input signal is provided to an up-mix matrix 620 together with the decorrelated signal. Dependent on up-mix control parameters 630, a stereo output signal is rendered. The signal decorrelator 610 generates a decorrelated signal D fed to the matrixing stage 620 along with the dry mono signal M. Inside the mixing matrix 620, the stereo channels L (L=Left stereo channel) and R (R=Right stereo channel) are formed according to a mixing matrix H. The coefficients in the matrix H can be fixed, signal dependent or controlled by a user.
Alternatively, the matrix can be controlled by side information, transmitted along with the down-mix, containing a parametric description on how to up-mix the signals of the down-mix to form the desired multi-channel output. This spatial side information is usually generated by a signal encoder prior to the up-mix process.
This is typically done in parametric spatial audio coding as, for example, in Parametric Stereo, cf. J. Breebaart, S. van de Par, A. Kohlrausch, E. Schuijers, “High-Quality Parametric Spatial Audio Coding at Low Bitrates” in AES 116th Convention, Berlin, Preprint 6072, May 2004 and in MPEG Surround, cf. J. Herre, K. Kjörling, J. Breebaart, et. al., “MPEG Surround—the ISO/MPEG Standard for Efficient and Compatible Multi-Channel Audio Coding” in Proceedings of the 122nd AES Convention, Vienna, Austria, May 2007. A typical structure of a parametric stereo decoder is shown in FIG. 7. In this example, the decorrelation process is performed in a transform domain, which is indicated by the analysis filterbank 710, which transforms an input mono signal to the transform domain as, for example, the frequency domain in terms of a number of frequency bands.
In the frequency domain, the decorrelator 720 generates the according decorrelated signal, which is to be up-mixed in the up-mix matrix 730. The up-mix matrix 730 considers up-mix parameters, which are provided by the parameter modification box 740, which is provided with spatial input parameters and coupled to a parameter control stage 750. In the example shown in FIG. 7, the spatial parameters can be modified by a user or additional tools as, for example, post-processing for binaural rendering/presentation. In this case, the up-mix parameters can be merged with the parameters from the binaural filters to form the input parameters for the up-mix matrix 730. The measuring of the parameters may be carried out by the parameter modification block 740. The output of the up-mix matrix 730 is then provided to a synthesis filterbank 760, which determines the stereo output signal.
As described above, the output L/R of the mixing matrix H can be computer from the mono input signal M and the decorrelated signal D, for example according to
      [                            L                                      R                      ]    =                    [                                                            h                11                                                                    h                12                                                                                        h                21                                                                    h                22                                                    ]            ⁡              [                                            M                                                          D                                      ]              .  
In the mixing matrix, the amount of decorrelated sound fed to the output can be controlled on the basis of transmitted parameters as, for example, ICC (ICC=Interchannel Correlation) and/or mixed or user-defined settings.
Another conventional approach is established by the temporal permutation method. A dedicated proposal on decorrelation of applause-like signals can be found, for example, in Gerard Hotho, Steven van de Par, Jeroen Breebaart, “Multichannel Coding of Applause Signals,” in EURASIP Journal on Advances in Signal Processing, Vol. 1, Art. 10, 2008. Here, a monophonic audio signal is segmented into overlapping time segments, which are temporally permuted pseudo randomly within a “super”-block to form the decorrelated output channels. The permutations are mutually independent for a number n output channels.
Another approach is the alternating channel swap of original and delayed copy in order to obtain a decorrelated signal, cf. German patent application 102007018032.4-55.
In some conventional conceptual object-orientated systems, e.g. in Wagner, Andreas; Walther, Andreas; Melchoir, Frank; StrauB, Michael; “Generation of Highly Immersive Atmospheres for Wave Field Synthesis Reproduction” at 116th International EAS Convention, Berlin, 2004, it is described how to create an immersive scene out of many objects as for example single claps, by application of a wave field synthesis.
Yet another approach is the so-called “directional audio coding” (DirAC=Directional Audio Coding), which is a method for spatial sound representation, applicable for different sound reproduction systems, cf. Pulkki, Ville, “Spatial Sound Reproduction with Directional Audio Coding” in J. Audio Eng. Soc., Vol. 55, No. 6, 2007. In the analysis part, the diffuseness and direction of arrival of sound are estimated in a single location dependent on time and frequency. In the synthesis part, microphone signals are first divided into non-diffuse and diffuse parts and are then reproduced using different strategies.
Conventional approaches have a number of disadvantages. For example, guided or unguided up-mix of audio signals having content such as applause may use a strong decorrelation.
Consequently, on the one hand, strong decorrelation is needed to restore the ambience sensation of being, for example, in a concert hall. On the other hand, suitable decorrelation filters as, for example, all-pass filters, degrade a reproduction of quality of transient events, like a single handclap by introducing temporal smearing effects such as pre- and post-echoes and filter ringing. Moreover, spatial panning of single clap events has to be done on a rather fine time grid, while ambience decorrelation should be quasi-stationary over time.
State of the art systems according to J. Breebaart, S. van de Par, A. Kohlrausch, E. Schuijers, “High-Quality Parametric Spatial Audio Coding at Low Bitrates” in AES 116th Convention, Berlin, Preprint 6072, May 2004 and J. Herre, K. Kjörling, J. Breebaart, et. al., “MPEG Surround—the ISO/MPEG Standard for Efficient and Compatible Multi-Channel Audio Coding” in Proceedings of the 122nd AES Convention, Vienna, Austria, May 2007 compromise temporal resolution vs. ambience stability and transient quality degradation vs. ambience decorrelation.
A system utilizing the temporal permutation method, for example, will exhibit perceivable degradation of the output sound due to a certain repetitive quality in the output audio signal. This is because of the fact that one and the same segment of the input signal appears unaltered in every output channel, though at a different point in time. Furthermore, to avoid increased applause density, some original channels have to be dropped in the up-mix and, thus, some important auditory event might be missed in the resulting up-mix.
In object-orientated systems, typically such sound events are spatialized as a large group of point-like sources, which leads to a computationally complex implementation.