The recent development in the area of audio coding has brought forward several parametric audio coding techniques for jointly coding a multi-channel audio signal (e.g. 5.1 channels) signal into one (or more) down-mix channel plus a side information stream. Generally, the side information stream has parameters relating to properties of the original channels of the multi-channel signal either with respect to other original channels of the multi-channel signal or with respect to the down-mix channel. The particular definition of parameters of the reference channel, to which these parameters relate, depends on the specific implementation. Some of the techniques known in the art are “binaural cue coding”, “spatial audio coding”, and “parametric stereo”.
For details of these particular implementations, reference is herewith made to related publications. Binaural cue coding is for example detailed in:
C. Faller and F. Baumgarte, “Efficient representation of spatial audio using perceptual parametrization,” IEEE WASPAA, Mohonk, N.Y., October 2001; F. Baumgarte and C. Faller, “Estimation of auditory spatial cues for binaural cue coding,” ICASSP, Orlando, Fla., May 2002; C. Faller and F. Baumgarte, “Binaural cue coding: a novel and efficient representation of spatial audio,” ICASSP, Orlando, Fla., May 2002; C. Faller and F. Baumgarte, “Binaural cue coding applied to audio compression with flexible rendering,” AES 113th Convention, Los Angeles, Preprint 5686, October 2002; C. Faller and F. Baumgarte, “Binaural Cue Coding—Part II: Schemes and applications,” IEEE Trans. on Speech and Audio Proc., vol. 11, no. 6, November 2003.
While binaural cue coding uses multiple original channels, parametric stereo is a related technique for the parametric coding of a two-channel stereo signal resulting in a transmitted mono signal and parameter side information, as for example reviewed in the following publications: J. Breebaart, S. van de Par, A. Kohlrausch, E. Schuijers, “High-Quality Parametric Spatial Audio Coding at Low Bitrates”, AES 116th Convention, Berlin, Preprint 6072, May 2004; E. Schuijers, J. Breebaart, H. Purnhagen, J. Engdegard, “Low Complexity Parametric Stereo Coding”, AES 116th Convention, Berlin, Preprint 6073, May 2004.
Other technologies are based on multiplexing of arbitrary numbers of audio sources or objects into a single transmission audio channel. Schemes based on multiplexing are, for example, introduced as “flexible rendering” in BCC (binaural cue coding) related publications or, more recently, by a scheme called “joint source coding” (JSC). Related publications are, for example: C. Faller, “Parametric Joint Coding of Audio Sources”, Convention Paper 6752, 120th AES Convention, Paris, May 2006. Similar to the parametric stepreo and binaural cue coding schemes, these techniques are intended to encode multiple original audio objects (channels) for transmission by fewer down-mix channels. By additionally deriving object-based parameters for each input channel, which can be encoded at a very low data rate and which are also transmitted to a receiver, these objects can be separated at the receiver side and rendered (mixed) to a certain number of output devices, as for example head phones, two-channel stereo loudspeakers, or multi-channel loudspeaker set-ups. This approach allows for level adjustment and redistribution (panning) of the different audio objects to different locations in the reproduction set-up, i.e. at the receiver side.
Basically, such techniques operate as M-k-N transmitter, with M being the number of audio objects at the input, k being the number of transmitted down-mix channels, typically k≦2. N is the number of audio channels at the renderer output, i.e. for example the number of loudspeakers. That is, N=2 for a stereo renderer or N=6 for a 5.1 multi-channel speaker set-up. In terms of compression efficiency, typical values are e.g. 64 kbps or less for a perceptually coded down-mix channel (consisting of k audio channels) and approximately 3 kbps for object parameters per transmitted audio object.
Application scenarios for the above techniques are for example encoding of spatial audio scenes related to cinemamovie-productions to allow for a spatial-reproduction of sound in a home-theatre system. Common examples are the widely known 5.1 and 7.1 surround-sound tracks on movie media, such as DVD and the like. Movie-productions are becoming more and more complex with respect to the audio-scenes, which are intended to provide a spatial listening experience and thus have to be mixed with great care. Different sound engineers may be commissioned with the mixing of different surround sources or sound-effects and therefore, transmission of parametrically encoded multi-channel scenarios between the individual sound engineers is desirable, to transport the audio-streams of the individual sound engineers efficiently.
Another application scenario for such a technology is teleconferencing with multiple talkers at either end of a point-to-point connection. To save bandwidth, most teleconferencing set-ups operate with monophonic transmission. Using, for example, joint source coding or one of the other multi-channel encoding techniques for transmission, redistribution and level-alignment of the different talkers at the receiving end (each end) can be achieved and thus the intelligibility and balance of the speakers is enhanced by spending a marginally increased bit rate as compared to a monophonic system. The advantage of increased intelligibility becomes particularly evident in the special case of assigning each individual participant of the conference to a single channel (and thus speaker) of a multi-channel speaker set-up at a receiving end. This, however, is a special case. In general, the number of participants will not match the number of speakers at the receiving end. However, using the existing speaker setup it is possible to render the signal associated with each participant such that it appears to be originating from any desired position. That is, the individual participant is not only recognized by his/her different voice but also by the location of the audio-source related to the talking participant.
While the state of the art techniques implement concepts as to how to efficiently encode multiple channels or audio objects, all of the presently known techniques lack the possibility to combine two or more of these transmitted audio-streams efficiently to derive an output stream (output signal), which is a representation of all of the input audio-streams (input audio signals).
The problem arises, for example, when a teleconferencing scenario with more than two locations is considered, each location having one or more speakers. Then, an intermediate instance is required to receive the audio input signals of the individual sources and to generate an audio output signal for each teleconferencing location having only the information of the remaining teleconferencing locations. That is, the intermediate instance has to generate an output signal, which is derived from a combination of two or more audio input signals and which allows for a reproduction of the individual audio channels or audio objects of the two or more input signals.
A similar scenario may occur when two audio-engineers in a cinema-movie production want to combine their spatial-audio signals to check for the listening impression generated by both signals. Then, it may be desirable to directly combine two encoded multi-channel signals to check for the combined listening impression. That is, a combined signal needs to be such that it resembles all of the audio objects (sources) of the two audio-engineers.
However, according to prior art techniques, such a combination is only feasible by decoding of the audio signals (streams). Then, the decoded audio signals may again be re-encoded by prior art multi-channel encoders to generate a combined signal in which all of the original audio channels or audio objects are represented appropriately.
This has the disadvantage of high computational complexity, thus wasting a lot of energy and making it some times even unfeasible to apply the concept, especially in real-time scenarios. Furthermore, a combination by subsequent audio decoding and re-encoding can cause a considerable delay due to the two processing steps which is unacceptable for certain applications, such as teleconferencing/telecommunications.