Embodiments according to the invention relate to an apparatus and a method for generating a multi-channel audio signal based on an input audio signal.
Some embodiments according to the invention relate to an audio signal processing, especially related to concepts for generating multi-channel signals, wherein not for each loudspeaker an own signal was transmitted.
When a signal with N audio channels is reproduced by an audio system with M reproduction channels (M>N), for example, the following possibilities exist:
1) Only a part of the available loudspeakers are used
2) A signal is generated, which makes use of the complete available reproduction system.
The second possibility is the favourable solution and is also called upmix in the following text.
In the context of upmixing there are two different kinds of methods for generating a multi-channel signal. For example, an existing multi-channel signal is summed up to a smaller number of channels in order to regenerate the original signal at the receiver based on additional data. This method is also called guided upmix.
The other possibility is a so-called blind upmix method. This concerns a multi-channel extension without previous knowledge. There is no additional data that controls the process. There is also no original sound impression or reference sound impression, which has to be reproduced or reached by the blind upmix.
Therefore, different approaches for realizing a blind upmix exist.
One possible approach is known as direct ambience concept. In this case, direct sound sources are reproduced by the three front channels (for example, for a so-called 5.1 home cinema system), so that the direct sound sources are heard by a listener at the same positions as in the original two-channel version (for example, when the input signal is a stereo signal).
FIG. 2 shows a schematic illustration of an audio signal reproduction 200 for a two-channel system. An original two-channel version is shown, for example, with three direct sound sources S1, S2, S3, 240. The audio signal is reproduced for a listener 210 by a left loudspeaker 220 and a right loudspeaker 230 and comprises signal portions of the three direct sound sources and an ambience portion 250 indicated by the encircled area. This is, for example, a standard two-channel stereo reproduction (3 sources and ambience).
FIG. 3 shows a schematic illustration of an audio signal reproduction 300 of a blind upmix according to the direct ambience concept. Five loudspeakers (center 310, front left 320, front right 330, rear left 340 and rear right 350) are shown for reproducing a multi-channel audio signal.
Direct sound sources 240 are reproduced by the three loudspeakers 310, 320, 330 in front. Ambience portions 250 contained in the audio track are reproduced by the front channels and the surround channels in order to envelope a listener 210.
Ambience portions are portions of the signal, which cannot be assigned to a single source, but are assigned to a combination of all sound components, which create an impression of the audible environment. Ambience portions may comprise, for example, room reflections and room reverberations, but also sounds of the audience, for example applause, natural sounds, for example rain or artificial sound effects, for example vinyl cracking sound.
A further possible concept is often mentioned as in-the-band concept. FIG. 4 shows a schematic illustration of an audio signal reproduction 400 according to the in-the-band concept. The arrangement of the loudspeakers corresponds to the arrangement of the loudspeakers in FIG. 3. However, each sound type, for example, direct sounds sources and ambience-like sounds are positions around the listener.
Since all output signals are generated from the same input signal, the output signals should be further decorrelated. For this, many known methods may be used, as for example temporal delay or the use of an all-pass filter. The mentioned simple methods often show additionally to the decorrelation effect disturbing drawbacks.
For example, one drawback is that nearly all decorrelation methods distort the temporal structure of the input signals, so that transient structures lose their transient character. This leads for example to the effect, that an applause-like ambience signal may only reach an enveloping effect, but no immersion.
Special signal types, such as applause or rain, take an exceptional position among the ambience signals. They are ambience signals, which do not necessarily give a room impression. They rather create an enveloping feeling by the vast number of temporal and spatial overlays of single portions, which comprise for their own direct sound character, as for example single claps or single raindrops. By the overlay, the resulting overall signal gets mainly the same statistical properties as known from room reverberation.
Especially these signal types are difficult to handle with an upmix method (by guided upmix as well as by blind upmix). Also, they often lead to a faulty upmix, for example, often a comb filter like effect can be heard.
Known blind upmix methods, which create the signal portions for the rear channels, so that these artifacts do not take place, generate a sound impression, that is limited to an impression, for example, where the audience claps in front of the listener and the surround channels only generate an impression of the room in which the applause takes place (enveloping ambience). But especially in these ambiences it is desirable to be a part of the clapping audience or to stay in the rain (immersive ambience). For this, all portions (similar to the in-the-band concept) should be distributed around the listener, but without any measures this would lead once again to a sound impression with artifacts.
In “A. Wagner, A. Walther, F. Melchior, M. Strauβ; “Generation of Highly Immersive Atmospheres for Wave Field Synthesis Reproduction”; Presented at the AES 116th Convention, Berlin, 2004” a method is described how an immersive ambience may be generated for a wave field synthesis. For that, a listener is surrounded by a 360° decorrelated, enveloping sound field, which gives an impression of the represented acoustic environment.
To reach an immersion effect, so-called focused sources are added. A focused source is a point sound source, which is perceptible as a single source and represents characteristic single sounds of the enveloping sound field.
According to the publication, single sources (sound particles) have to be available for each ambience in large numbers and may either be separately recorded sounds or artificial sounds generated by a synthesizer.
This object-oriented approach has the drawback that different audio signals for each ambience type should already be available. At one hand, the enveloping ambience signals as decorrelated single tracks, at the other hand, the single sound sources as separate audio files. A mentioned alternative is to generate (for example with a synthesizer software) these for each ambience type (if it is know) artificially, which includes the risk, that they do not fit to the reproduced ambience. Additionally, for such a generation, for example, a mathematical model of the particle sounds and a lot of computing time is needed. In general, the effort for a wave field synthesis is very high.
In “Gerard Hotho; Steven van de Par; Jeroen Breebart; “Multichannel Coding of Applause Signals”; Research Article” a method for multi-channel coding of applause signals is described, which especially includes a method for a decorrelation of random ambiences (called: applause, rain, crackling).
Here, it is mentioned, that a frequency-selective coder makes the quality of the signals worse and therefore an only time domain-based coder is presented.
In this connection only a decorrelation should be made, which means basically all signals sound equal (or as at the input). A decorrelation method is introduced with which a reproduction of a reference sound should be successful.
In an earlier non-prepublished european patent application with the application number EP 08018793 a method is introduced which decomposes an applause-like signal into a foreground sound and a background sound. Reference is also made to “A. Wagner, A. Walther, F. Melchior, M. Strauβ; “Generation of Highly Immersive Atmospheres for Wave Field Synthesis Reproduction”; Presented at the AES 116th Convention, Berlin, 2004”. An enveloping ambience is separated from the perceptible single sounds, from which the ambience consists of, and then these two parts can be handled separated from each other.
In the mentioned non-prepublished patent application a method is described including one embodiment (guided mode) trying to reproduce the original ambience. In principle, the background sounds (different than the foreground sounds) are only decorrelated and the foreground sounds are only placed at different times at different positions. It may be said that it only concerns a decorrelation method.
The overall signal is decomposed in a foreground and a background. It can be assumed that only a common reproduction of the separated parts will again sound good, but both themselves may comprise artifacts.
Further known upmix methods are described for example in “Roy Irwan and Ronaldus Aarts, “Multi-Channel Audio Converter”, International Publication Number: WO 02/052896 A2”, in “Carlos Avendano and Jean-Marc Jot, “Stream Segregation For Stereo Signals”, Pub. No. US 2007/0041592 A1”, in “David Griesinger, “Multichannel Active Matrix Encoder And Decoder With Maximum Lateral Separation”, Patent Number US005870480A” and in “Jan Petersen, “Multi-Channel Sound Reproduction System For Stereophonic Signals”, International Publication Number WO 01/62045 A1”, which do not differentiate between different input signals.