Any discussion of the background art throughout the specification should in no way be considered as an admission that such art is widely known or forms part of common general knowledge in the field.
Content creation, coding, distribution and reproduction of audio are traditionally performed in a channel based format, that is, one specific target playback system is envisioned for content throughout the content ecosystem. Examples of such target playback systems audio formats are mono, stereo, 5.1, 7.1, and the like.
If content is to be reproduced on a different playback system than the intended one, a downmixing or upmixing process can be applied. For example, 5.1 content can be reproduced over a stereo playback system by employing specific downmix equations. Another example is playback of stereo encoded content over a 7.1 speaker setup, which may comprise a so-called upmixing process, which could or could not be guided by information present in the stereo signal. A system capable of upmixing is Dolby Pro Logic from Dolby Laboratories Inc (Roger Dressler, “Dolby Pro Logic Surround Decoder, Principles of Operation”, www.Dolby.com).
An alternative audio format system is an audio object format such as that provided by the Dolby Atmos system. In this type of format, objects are defined to have a particular location around a listener, which may be time varying. Audio content in this format is sometimes referred to as immersive audio content.
When stereo or multi-channel content is to be reproduced over headphones, it is often desirable to simulate a multi-channel speaker setup by means of head-related impulse responses (HRIRs), or binaural room impulse responses (BRIRs), which simulate the acoustical pathway from each loudspeaker to the ear drums, in an anechoic or echoic (simulated) environment, respectively. In particular, audio signals can be convolved with HRIRs or BRIRs to re-instate inter-aural level differences (ILDs), inter-aural time differences (ITDs) and spectral cues that allow the listener to determine the location of each individual channel. The simulation of an acoustic environment (reverberation) also helps to achieve a certain perceived distance. FIG. 1 illustrates a schematic overview of the processing flow for rendering two object or channel signals xi 10, 11, being read out of a content store 12 for processing by 4 HRIRs e.g. 14. The HRIR outputs are then summed 15, 16, for each channel signal, so as to produce headphone speaker outputs for playback to a listener via headphones 18. The basic principle of HRIRs is, for example, explained in Wightman, Frederic L., and Doris J. Kistler. “Sound localization.” Human psychophysics. Springer N.Y., 1993. 155-192.
The HRIR/BRIR convolution approach comes with several drawbacks, one of them being the substantial amount of convolution processing that is required for headphone playback. The HRIR or BRIR convolution needs to be applied for every input object or channel separately, and hence complexity typically grows linearly with the number of channels or objects. As headphones are often used in conjunction with battery-powered portable devices, a high computational complexity is not desirable as it may substantially shorten battery life. Moreover, with the introduction of object-based audio content, which may comprise say more than 100 objects active simultaneously, the complexity of HRIR convolution can be substantially higher than for traditional channel-based content.
For this purpose, co-pending and non-published PCT application PCT/US2016/048497, filed Aug. 24, 2016 describes a dual-ended approach for presentation transformations that can be used to efficiently transmit and decode immersive audio for headphones. The coding efficiency and decoding complexity reduction are achieved by splitting the rendering process across encoder and decoder, rather than relying on the decoder alone to render all objects.
FIG. 2 gives a schematic overview of such a dual-ended approach to deliver immersive audio on headphones. With reference to FIG. 2, in the dual-ended approach any acoustic environment simulation algorithm (for example an algorithmic reverberation, such as a feedback delay network or FDN, a convolution reverberation algorithm, or other means to simulate acoustic environments) is driven by a simulation input signal {circumflex over (f)} that is derived from a core decoder output stereo signal z by application of time and frequency dependent parameters w that are included in the bit stream. The parameters w are used as matrix coefficients to perform a matrix transform of the stereo signal z, to generate an anechoic binaural signal ŷ and the simulation input signal {circumflex over (f)}. It is important to realize that the simulation input signal {circumflex over (f)} typically consists of a mixture of various of the objects that were provided to the encoder as input, and moreover the contribution of these individual input objects can vary depending on the object distance, the headphone rendering metadata, semantic labels, and alike. Subsequently the input signal {circumflex over (f)} is used to produce the output of the acoustic environment simulation algorithm and is mixed with the anechoic binaural signal ŷ to create the echoic, final binaural presentation.
Although the acoustic environment simulation input signal {circumflex over (f)} is derived from a stereo signal using the set of parameters, its level (for example its energy as a function of frequency) is not a priori known nor available. Such properties can be measured in a decoder at the expense of introducing additional complexity and latency, which both are undesirable on mobile platforms.
Further, the environment simulation input signal typically increases in level with object distance to simulate the decreasing direct-to-late reverberation ratio that occurs in physical environments. This implies that there is no well-defined upper bound of the input signal {circumflex over (f)}, which is problematic from an implementation point of view requiring a bounded dynamic range.
Also, if the simulation algorithm is end-user configurable, the transfer function of the acoustic environment simulation algorithm is not known during encoding. As a consequence, the signal level (and hence the perceived loudness) of the binaural presentation after mixing in the acoustic environment simulation output signal is unknown.
The fact that both the input signal level and the transfer function of the acoustic environment simulation are unknown makes it difficult to control the loudness of the binaural presentation. Such loudness preservation is generally very desirable for end-user convenience as well as broadcast loudness compliance as standardized in for example ITU-R bs.1770 and EBU R128.