I. Introduction
In a general coding problem, we have a number of (mono) source signals si(n) (1≤i≤M) and a scene description vector S(n), where n is the time index. The scene description vector contains parameters such as (virtual) source positions, source widths, and acoustic parameters such as (virtual) room parameters. The scene description may be time-invariant or may be changing over time. The source signals and scene description are coded and transmitted to a decoder. The coded source signals, ŝi(n) are successively mixed as a function of the scene description, Ŝ(n), to generate wavefield synthesis, multi-channel, or stereo signals as a function of the scene description vector. The decoder output signals are denoted {circumflex over (x)}i(n) (0≤i≤N). Note that the scene description vector S(n) may not be transmitted but may be determined at the decoder. In this document, the term “stereo audio signal” always refers to two-channel stereo audio signals.
ISO/IEC MPEG-4 addresses the described coding scenario. It defines the scene description and uses for each (“natural”) source signal a separate mono audio coder, e.g. an AAC audio coder. However, when a complex scene with many sources is to be mixed, the bitrate becomes high, i.e. the bitrate scales up with the number of sources. Coding one source signal with high quality requires about 60-90 kb/s.
Previously, we addressed a special case of the described coding problem [1][2] with a scheme denoted Binaural Cue Coding (BCC) for Flexible Rendering. By transmitting only the sum of the given source signals plus low bitrate side information, low bitrate is achieved. However, the source signals can not be recovered at the decoder and the scheme was limited to stereo and multi-channel surround signal generation. Also, only simplistic mixing was used, based on amplitude and delay panning. Thus, the direction of sources could be controlled but no other auditory spatial image attributes. Another limitation of this scheme was its limited audio quality. Especially, a decrease in audio quality as the number of source signals is increased.
The document [1], (Binaural Cue Coding, Parametric Stereo, MP3 Surround, MPEG Surround) covers the case where N audio channels are encoded and N audio channels with similar cues then the original audio channels are decoded. The transmitted side information includes inter-channel cue parameters relating to differences between the input channels.
The channels of stereo and multi-channel audio signals contain mixes of audio sources signals and are thus different in nature than pure audio source signals. Stereo and multi-channel audio signals are mixed such that when played back over an appropriate playback system, the listener will perceive an auditory spatial image (“sound stage”) as captured by the recording setup or designed by the recording engineer during mixing. A number of schemes for joint-coding for the channels of a stereo or multi-channel audio signal have been proposed previously.