The present application relates to binaural rendering of a multi-channel audio signal.
Many audio encoding algorithms have been proposed in order to effectively encode or compress audio data of one channel, i.e., mono audio signals. Using psychoacoustics, audio samples are appropriately scaled, quantized or even set to zero in order to remove irrelevancy from, for example, the PCM coded audio signal. Redundancy removal is also performed.
As a further step, the similarity between the left and right channel of stereo audio signals has been exploited in order to effectively encode/compress stereo audio signals.
However, upcoming applications pose further demands on audio coding algorithms. For example, in teleconferencing, computer games, music performance and the like, several audio signals which are partially or even completely uncorrelated have to be transmitted in parallel. In order to keep the necessary bit rate for encoding these audio signals low enough in order to be compatible to low-bit rate transmission applications, recently, audio codecs have been proposed which downmix the multiple input audio signals into a downmix signal, such as a stereo or even mono downmix signal. For example, the MPEG Surround standard downmixes the input channels into the downmix signal in a manner prescribed by the standard. The downmixing is performed by use of so-called OTT−1 and TTT−1 boxes for downmixing two signals into one and three signals into two, respectively. In order to downmix more than three signals, a hierarchic structure of these boxes is used. Each OTT−1 box outputs, besides the mono downmix signal, channel level differences between the two input channels, as well as inter-channel coherence/cross-correlation parameters representing the coherence or cross-correlation between the two input channels. The parameters are output along with the downmix signal of the MPEG Surround coder within the MPEG Surround data stream. Similarly, each TTT−1 box transmits channel prediction coefficients enabling recovering the three input channels from the resulting stereo downmix signal. The channel prediction coefficients are also transmitted as side information within the MPEG Surround data stream. The MPEG Surround decoder upmixes the downmix signal by use of the transmitted side information and recovers, the original channels input into the MPEG Surround encoder.
However, MPEG Surround, unfortunately, does not fulfill all requirements posed by many applications. For example, the MPEG Surround decoder is dedicated for upmixing the downmix signal of the MPEG Surround encoder such that the input channels of the MPEG Surround encoder are recovered as they are. In other words, the MPEG Surround data stream is dedicated to be played back by use of the loudspeaker configuration having been used for encoding, or by typical configurations like stereo.
However, according to some applications, it would be favorable if the loudspeaker configuration could be changed at the decoder's side freely.
In order to address the latter needs, the spatial audio object coding (SAOC) standard is currently designed. Each channel is treated as an individual object, and all objects are downmixed into a downmix signal. That is, the objects are handled as audio signals being independent from each other without adhering to any specific loudspeaker configuration but with the ability to place the (virtual) loudspeakers at the decoder's side arbitrarily. The individual objects may comprise individual sound sources as e.g. instruments or vocal tracks. Differing from the MPEG Surround decoder, the SAOC decoder is free to individually upmix the downmix signal to replay the individual objects onto any loudspeaker configuration. In order to enable the SAOC decoder to recover the individual objects having been encoded into the SAOC data stream, object level differences and, for objects forming together a stereo (or multi-channel) signal, inter-object cross correlation parameters are transmitted as side information within the SAOC bitstream. Besides this, the SAOC decoder/transcoder is provided with information revealing how the individual objects have been downmixed into the downmix signal. Thus, on the decoder's side, it is possible to recover the individual SAOC channels and to render these signals onto any loudspeaker configuration by utilizing user-controlled rendering information.
However, although the afore-mentioned codecs, i.e. MPEG Surround and SAOC, are able to transmit and render multi-channel audio content onto loudspeaker configurations having more than two speakers, the increasing interest in headphones as audio reproduction system necessitates that these codecs are also able to render the audio content onto headphones. In contrast to loudspeaker playback, stereo audio content reproduced over headphones is perceived inside the head. The absence of the effect of the acoustical pathway from sources at certain physical positions to the eardrums causes the spatial image to sound unnatural since the cues that determine the perceived azimuth, elevation and distance of a sound source are essentially missing or very inaccurate. Thus, to resolve the unnatural sound stage caused by inaccurate or absent sound source localization cues on headphones, various techniques have been proposed to simulate a virtual loudspeaker setup. The idea is to superimpose sound source localization cues onto each loudspeaker signal. This is achieved by filtering audio signals with so-called head-related transfer functions (HRTFs) or binaural room impulse responses (BRIRs) if room acoustic properties are included in these measurement data. However, filtering each loudspeaker signal with the just-mentioned functions would necessitate a significantly higher amount of computation power at the decoder/reproduction side. In particular, rendering the multi-channel audio signal onto the “virtual” loudspeaker locations would have to be performed first wherein, then, each loudspeaker signal thus obtained is filtered with the respective transfer function or impulse response to obtain the left and right channel of the binaural output signal. Even worse: the thus obtained binaural output signal would have a poor audio quality due to the fact that in order to achieve the virtual loudspeaker signals, a relatively large amount of synthetic decorrelation signals would have to be mixed into the upmixed signals in order to compensate for the correlation between originally uncorrelated audio input signals, the correlation resulting from downmixing the plurality of audio input signals into the downmix signal.
In the current version of the SAOC codec, the SAOC parameters within the side information allow the user-interactive spatial rendering of the audio objects using any playback setup with, in principle, including headphones. Binaural rendering to headphones allows spatial control of virtual object positions in 3D space using head-related transfer function (HRTF) parameters. For example, binaural rendering in SAOC could be realized by restricting this case to the mono downmix SAOC case where the input signals are mixed into the mono channel equally. Unfortunately, mono downmix necessitates all audio signals to be mixed into one common mono downmix signal so that the original correlation properties between the original audio signals are maximally lost and therefore, the rendering quality of the binaural rendering output signal is non-optimal.