The present invention involves an apparatus and a method of generating decorrelated signals and in particular the ability of deriving decorrelated signals from a signal containing transients such that reconstructing a four-channel audio signal and/or a future combination of the decorrelated signal and the transient signal will not result in any audible signal degradation.
Many applications in the field of audio signal processing necessitate generating a decorrelated signal based on an audio input signal provided. As examples thereof, the stereo upmix of a mono signal, the four-channel upmix based on a mono or stereo signal, the generation of artificial reverberation or the widening of the stereo basis may be named.
Current methods and/or systems suffer from extensive degradation of the quality and/or the perceivable sound impression when confronted with a special class of signals (applause-like signals). This is specifically the case when the playback is effected via headphones. In addition to that, standard decorrelators use methods exhibiting high complexity and/or high computing expenditure.
For emphasizing the problem, FIGS. 7 and 8 show the use of decorrelators in signal processing. Here, brief reference is made to the mono-to-stereo decoder shown in FIG. 7.
Same comprises a standard decorrelator 10 and a mix matrix 12. The mono-to-stereo decoder serves for converting a fed-in mono signal 14 to a stereo signal 16 consisting of a left channel 16a and a right channel 16b. From the fed-in mono signal 14, the standard decorrelator 10 generates a decorrelated signal 18 (D) which, together with the fed-in mono signal 14, is applied to the inputs of the mix matrix 12. In this context, the untreated mono signal is often also referred to as a “dry” signal, whereas the decorrelated signal D is referred to as a “wet” signal.
The mix matrix 12 combines the decorrelated signal 18 and the fed-in mono signal 14 so as to generate the stereo signal 16. Here, the coefficients of the mix matrix 12 (H) may either be fixedly given, signal-dependent or dependent on a user input. In addition, this mixing process performed by the mix matrix 12 may also be frequency-selective. I.e., different mixing operations and/or matrix coefficients may be employed for different frequency ranges (frequency bands). For this purpose, the fed-in mono signal 14 may be preprocessed by a filter bank so that same, together with the decorrelated signal 18, is present in a filter bank representation, in which the signal portions pertaining to different frequency bands are each processed separately.
The control of the upmix process, i.e. of the coefficients of the mix matrix 12, may be performed by user interaction via a mix control 20. In addition, the coefficients of the mix matrix 12 (H) may also be effected via so-called “side information”, which is transferred together with the fed-in mono signal 14 (the downmix). Here, the side information contains a parametric description as to how the multi-channel signal generated is to be generated from the fed-in mono signal 14 (the transmitted signal). This spatial side information is typically generated by an encoder prior to the actual downmix, i.e. the generation of the fed-in mono signal 14.
The above-described process is normally employed in parametric (spatial) audio coding. As an example, the so-called “Parametric Stereo” coding (H. Purnhagen: “Low Complexity Parametric Stereo Coding in MPEG-4”, 7th International Conference on Audio Effects (DAFX-04), Naples, Italy, October 2004) and the MPEG Surround method (L. Villemoes, J. Herre, J. Breebaart, G. Hotho, S. Disch, H. Purnhagen, K. Kjörling: “MPEG Surround: The forthcoming ISO standard for spatial audio coding”, AES 28th International Conference, Piteå, Sweden, 2006) use such a method.
One typical example of a Parametric Stereo decoder is shown in FIG. 8. In addition to the simple, non-frequency-selective case shown in FIG. 7, the decoder shown in FIG. 6 comprises an analysis filter bank 30 and a synthesis filter bank 32. This is the case, as here decorrelating is performed in a frequency-dependent manner (in the spectral domain). For this reason, the fed-in mono signal 14 is first split into signal portions for different frequency ranges by the analysis filter bank 30. I.e., for each frequency band its own decorrelated signal is generated analogously to the example described above. In addition to the fed-in mono signal 14, spatial parameters 34 are transferred, which serve to determine or vary the matrix elements of the mix matrix 12 so as to generate a mixed signal which, by means of the synthesis filter bank 32, is transformed back into the time domain so as to form the stereo signal 16.
In addition, the spatial parameters 34 may optionally be altered via a parameter control 36 so as to generate the upmix and/or the stereo signal 16 for different playback scenarios in a different manner and/or optimally adjust the playback quality to the respective scenario. If the spatial parameters 34 are adjusted for binaural playback, for example, the spatial parameters 34 may be combined with parameters of the binaural filters so as to form the parameters controlling the mix matrix 12. Alternatively, the parameters may be altered by direct user interaction or other tools and/or algorithms (see, for example: Breebart, Jeroen; Herre, Jurgen; Jin, Craig; Kjörling, Kristofer; Koppens, Jeroen; Plogisties, Jan; Villemoes, Lars: Multi-Channel Goes Mobile: MPEG Surround Binaural Rendering. AES 29th International Conference, Seoul, Korea, 2006 Sep. 2-4).
The output of the channels L and R of the mix matrix 12 (H) is generated from the fed-in mono signal 14 (M) and the decorrelated signal 18 (D) as follows, for example:
      [                            L                                      R                      ]    =            [                                                  h              11                                                          h              12                                                                          h              21                                                          h              22                                          ]        ⁡          [                                    M                                                D                              ]      
Therefore, the portion of the decorrelated signal 18 (D) contained in the output signal is adjusted in the mix matrix 12. In the process, the mixing ratio is time-varied based on the spatial parameters 34 transferred. These parameters may, for example, be parameters describing the correlation of two original signals (parameters of this kind are used in MPEG Surround Coding, for example, and there are referred to, among other things, as ICC). In addition, parameters may be transferred, which transfer the energy ratios of two channels originally present, which are contained in the fed-in mono signal 14 (ICLD and/or ICD in MPEG Surround). Alternatively, or in addition, the matrix elements may be varied by direct user input.
For the generation of the decorrelated signals, a series of different methods have so far been used.
Parametric Stereo and MPEG Surround use all-pass filters, i.e. filters passing the entire spectral range but having a spectrally dependent filter characteristic. In Binaural Cue Coding (BCC, Faller and Baumgarte, see, for example: C. Faller: “Parametric Coding Of Spatial Audio”, Ph.D. thesis, EPFL, 2004) a “group delay” for decorrelation is proposed. For this purpose, a frequency-dependent group delay is applied to the signal by altering the phases in the DFT spectrum of the signal. That is, different frequency ranges are delayed for different periods of time. Such a method usually falls under the category of phase manipulations.
In addition, the use of simple delays, i.e. fixed time delays, is known. This method is used for generating surround signals for the rear speakers in a four-channel configuration, for example, so as to decorrelate same from the front signals as far as perception is concerned. A typical such matrix surround system is Dolby ProLogic II, which uses a time delay from 20 to 40 ms for the rear audio channels. Such a simple implementation may be used for creating a decorrelation of the front and rear speakers as same is substantially less critical, as far as the listening experience is concerned, than the decorrelation of left and right channels. This is of substantial importance for the “width” of the reconstructed signal as perceived by the listener (see: J. Blauert: “Spatial hearing: The psychophysics of human sound localization”; MIT Press, Revised edition, 1997).
The popular decorrelation methods described above exhibit the following substantial drawbacks:                spectral coloration of the signal (comb-filter effect)        reduced “crispness” of the signal        disturbing echo and reverberation effects        unsatisfactorily perceived decorrelation and/or unsatisfactory width of the audio mapping        repetitive sound character.        
Here, the invention has shown that it is in particular signals having high temporal density and spatial distribution of transient events, which are transferred together with a broadband noise-like signal component, that represent the signals most critical for this type of signal processing. This is in particular the case for applause-like signals possessing the above-mentioned properties. This is due to the fact that, by the decorrelation, each single transient signal (event) may be smeared in terms of time, whereas at the same time the noise-like background is rendered spectrally colored due to comb-filter effects, which is easy to perceive as a change in the signal's timbre.
To summarize, the known decorrelation methods either generate the above-mentioned artifacts or else are unable to generate the necessitated degree of decorrelation.
It is especially to be noted that listening via headphones is generally more critical than listening via speakers. For this reason, the above-described drawbacks are relevant in particular for applications that generally necessitate listening by means of headphones. This is generally the case for portable playback devices, which, in addition, have a low energy supply only. In this context, the computing capacity which has to be spent on the decorrelation is also an important aspect. Most of the known decorrelation algorithms are extremely computationally intensive. In an implementation these therefore necessitate a relatively high number of calculation operations, which result in having to use fast processors, which inevitably consume large amounts of energy. In addition, a large amount of memory is required for implementing such complex algorithms. This, in turn, results in increased energy demand.
Particularly in the playback of binaural signals (and in listening via headphones) a number of special problems will occur concerning the perceived reproduction quality of the rendered signal. For one thing, in the case of applause signals, it is particularly important to correctly render the attack of each clapping event so as not to corrupt the transient event. A decorrelator is therefore required, which does not smear the attack in time in terms of time, i.e. which does not exhibit any temporally dispersive characteristic. Filters described above, which introduce frequency-dependent group delay, and all-pass filters in general are not suitable for this purpose. In addition, it is a need to avoid a repetitive sound impression as is caused by a simple time delay, for example. If such a simple time delay were used to generate a decoded signal, which was then added to the direct signal by means of a mix matrix, the result would sound extremely repetitive and therefore unnatural. Such a static delay in addition generates comb-filter effects, i.e. undesired spectral colorations in the reconstructed signal.
A use in simple time delays in addition results in the known precedence effect (see, for example: J. Blauert: “Spatial hearing: The psychophysics of human sound localization”; MIT Press, Revised edition, 1997). Same originates from the fact that there is an output channel leading in terms of time and an output channel following in terms of time when a simple time delay is used. The human ear perceives the origin of a tone or sound or an object in that spatial direction from which it first hears the noise. I.e., the signal source is perceived in that direction in which the signal portion of the temporally leading output channel (leading signal) happens to be played back, irrespective of whether the spatial parameters actually responsible for the spatial allocation indicate something different.