The present application is concerned with audio coding using down-mixing of signals.
Many audio encoding algorithms have been proposed in order to effectively encode or compress audio data of one channel, i.e., mono audio signals. Using psychoacoustics, audio samples are appropriately scaled, quantized or even set to zero in order to remove irrelevancy from, for example, the PCM coded audio signal. Redundancy removal is also performed.
As a further step, the similarity between the left and right channel of stereo audio signals has been exploited in order to effectively encode/compress stereo audio signals.
However, upcoming applications pose further demands on audio coding algorithms. For example, in teleconferencing, computer games, music performance and the like, several audio signals which are partially or even completely uncorrelated have to be transmitted in parallel. In order to keep the bit rate for encoding these audio signals low enough in order to be compatible to low-bit rate transmission applications, recently, audio codecs have been proposed which downmix the multiple input audio signals into a downmix signal, such as a stereo or even mono downmix signal. For example, the MPEG Surround standard downmixes the input channels into the downmix signal in a manner prescribed by the standard. The downmixing is performed by use of so-called OTT−1 and TTT−1 boxes for downmixing two signals into one and three signals into two, respectively. In order to downmix more than three signals, a hierarchic structure of these boxes is used. Each OTT−1 box outputs, besides the mono downmix signal, channel level differences between the two input channels, as well as inter-channel coherence/cross-correlation parameters representing the coherence or cross-correlation between the two input channels. The parameters are output along with the downmix signal of the MPEG Surround coder within the MPEG Surround data stream. Similarly, each TTT−1 box transmits channel prediction coefficients enabling recovering the three input channels from the resulting stereo downmix signal. The channel prediction coefficients are also transmitted as side information within the MPEG Surround data stream. The MPEG Surround decoder upmixes the downmix signal by use of the transmitted side information and recovers, the original channels input into the MPEG Surround encoder.
However, MPEG Surround, unfortunately, does not fulfill all requirements posed by many applications. For example, the MPEG Surround decoder is dedicated for upmixing the downmix signal of the MPEG Surround encoder such that the input channels of the MPEG Surround encoder are recovered as they are. In other words, the MPEG Surround data stream is dedicated to be played back by use of the loudspeaker configuration having been used for encoding.
However, according to some implications, it would be favorable if the loudspeaker configuration could be changed at the decoder's side.
In order to address the latter needs, the spatial audio object coding (SAOC) standard is currently designed. Each channel is treated as an individual object, and all objects are downmixed into a downmix signal. However, in addition the individual objects may also comprise individual sound sources as e.g. instruments or vocal tracks. However, differing from the MPEG Surround decoder, the SAOC decoder is free to individually upmix the downmix signal to replay the individual objects onto any loudspeaker configuration. In order to enable the SAOC decoder to recover the individual objects having been encoded into the SAOC data stream, object level differences and, for objects forming together a stereo (or multi-channel) signal, inter-object cross correlation parameters are transmitted as side information within the SAOC bitstream. Besides this, the SAOC decoder/transcoder is provided with information revealing how the individual objects have been downmixed into the downmix signal. Thus, on the decoder's side, it is possible to recover the individual SAOC channels and to render these signals onto any loudspeaker configuration by utilizing user-controlled rendering information.
However, although the SAOC codec has been designed for individually handling audio objects, some applications are even more demanding. For example, Karaoke applications necessitate a complete separation of the background audio signal from the foreground audio signal or foreground audio signals. Vice versa, in the solo mode, the foreground objects have to be separated from the background object. However, owing to the equal treatment of the individual audio objects it was not possible to completely remove the background objects or the foreground objects, respectively, from the downmix signal.