Digital encoding of various source signals has become increasingly important over the last decades as digital signal representation and communication increasingly has replaced analogue representation and communication. For example, audio content, such as speech and music, is increasingly based on digital content encoding. Furthermore, audio consumption has increasingly become an enveloping three dimensional experience with e.g. surround sound and home cinema setups becoming prevalent.
Audio encoding formats have been developed to provide increasingly capable, varied and flexible audio services and in particular audio encoding formats supporting spatial audio services have been developed.
Well known audio coding technologies like DTS and Dolby Digital produce a coded multi-channel audio signal that represents the spatial image as a number of channels that are placed around the listener at fixed positions. For a speaker setup which is different from the setup that corresponds to the multi-channel signal, the spatial image will be suboptimal. Also, channel based audio coding systems are typically not able to cope with a different number of speakers.
(ISO/IEC MPEG-D) MPEG Surround provides a multi-channel audio coding tool that allows existing mono- or stereo-based coders to be extended to multi-channel audio applications. FIG. 1 illustrates an example of the elements of an MPEG Surround system. Using spatial parameters obtained by analysis of the original multichannel input, an MPEG Surround decoder can recreate the spatial image by a controlled upmix of the mono- or stereo signal to obtain a multichannel output signal.
Since the spatial image of the multi-channel input signal is parameterized, MPEG Surround allows for decoding of the same multi-channel bit-stream by rendering devices that do not use a multichannel speaker setup. An example is virtual surround reproduction on headphones, which is referred to as the MPEG Surround binaural decoding process. In this mode a realistic surround experience can be provided while using regular headphones. Another example is the pruning of higher order multichannel outputs, e.g. 7.1 channels, to lower order setups, e.g. 5.1 channels.
Indeed, the variation and flexibility in the rendering configurations used for rendering spatial sound has increased significantly in recent years with more and more reproduction formats becoming available to the mainstream consumer. This requires a flexible representation of audio. Important steps have been taken with the introduction of the MPEG Surround codec. Nevertheless, audio is still produced and transmitted for a specific loudspeaker setup, e.g. an ITU 5.1 speaker setup. Reproduction over different setups and over non-standard (i.e. flexible or user-defined) speaker setups is not specified. Indeed, there is a desire to make audio encoding and representation increasingly independent of specific predetermined and nominal speaker setups. It is increasingly preferred that flexible adaptation to a wide variety of different speaker setups can be performed at the decoder/rendering side.
In order to provide for a more flexible representation of audio, MPEG standardized a format known as ‘Spatial Audio Object Coding’ (ISO/IEC MPEG-D SAOC). In contrast to multichannel audio coding systems such as DTS, Dolby Digital and MPEG Surround, SAOC provides efficient coding of individual audio objects rather than audio channels. Whereas in MPEG Surround, each speaker channel can be considered to originate from a different mix of sound objects, SAOC makes individual sound objects available at the decoder side for interactive manipulation as illustrated in FIG. 2. In SAOC, multiple sound objects are coded into a mono or stereo downmix together with parametric data allowing the sound objects to be extracted at the rendering side thereby allowing the individual audio objects to be available for manipulation e.g. by the end-user.
Indeed, similarly to MPEG Surround, SAOC also creates a mono or stereo downmix. In addition object parameters are calculated and included. At the decoder side, the user may manipulate these parameters to control various features of the individual objects, such as position, level, equalization, or even to apply effects such as reverb. FIG. 3 illustrates an interactive interface that enables the user to control the individual objects contained in an SAOC bitstream. By means of a rendering matrix individual sound objects are mapped onto speaker channels.
SAOC allows a more flexible approach and in particular allows more rendering based adaptability by transmitting audio objects in addition to only reproduction channels. This allows the decoder-side to place the audio objects at arbitrary positions in space, provided that the space is adequately covered by speakers. This way there is no relation between the transmitted audio and the reproduction or rendering setup, hence arbitrary speaker setups can be used. This is advantageous for e.g. home cinema setups in a typical living room, where the speakers are almost never at the intended positions. In SAOC, it is decided at the decoder side where the objects are placed in the sound scene, which is often not desired from an artistic point-of-view. The SAOC standard does provide ways to transmit a default rendering matrix in the bitstream, eliminating the decoder responsibility. However the provided methods rely on either fixed reproduction setups or on unspecified syntax. Thus SAOC does not provide normative means to fully transmit an audio scene independently of the speaker setup. Also, SAOC is not well equipped to the faithful rendering of diffuse signal components. Although there is the possibility to include a so called Multichannel Background Object (MBO) to capture the diffuse sound, this object is tied to one specific speaker configuration.
Another specification for an audio format for 3D audio is being developed by the 3D Audio Alliance (3DAA) which is an industry alliance. 3DAA is dedicated to develop standards for the transmission of 3D audio, that “will facilitate the transition from the current speaker feed paradigm to a flexible object-based approach”. In 3DAA, a bitstream format is to be defined that allows the transmission of a legacy multichannel downmix along with individual sound objects. In addition, object positioning data is included. The principle of generating a 3DAA audio stream is illustrated in FIG. 4.
In the 3DAA approach, the sound objects are received separately in the extension stream and these may be extracted from the multi-channel downmix. The resulting multi-channel downmix is rendered together with the individually available objects.
The objects may consist of so called stems. These stems are basically grouped (downmixed) tracks or objects. Hence, an object may consist of multiple sub-objects packed into a stem. In 3DAA, a multichannel reference mix can be transmitted with a selection of audio objects. 3DAA transmits the 3D positional data for each object. The objects can then be extracted using the 3D positional data. Alternatively, the inverse mix-matrix may be transmitted, describing the relation between the objects and the reference mix.
From the description of 3DAA, sound-scene information is likely transmitted by assigning an angle and distance to each object, indicating where the object should be placed relative to e.g. the default forward direction. Thus, positional information is transmitted for each object. This is useful for point-sources but fails to describe wide sources (like e.g. a choir or applause) or diffuse sound fields (such as ambience). When all point-sources are extracted from the reference mix, an ambient multichannel mix remains. Similar to SAOC, the residual in 3DAA is fixed to a specific speaker setup.
Thus, both the SAOC and 3DAA approaches incorporate the transmission of individual audio objects that can be individually manipulated at the decoder side. A difference between the two approaches is that SAOC provides information on the audio objects by providing parameters characterizing the objects relative to the downmix (i.e. such that the audio objects are generated from the downmix at the decoder side) whereas 3DAA provides audio objects as full and separate audio objects (i.e. that can be generated independently from the downmix at the decoder side). For both approaches, position data may be communicated for the audio objects.
Binaural processing where a spatial experience is created by virtual positioning of sound sources using individual signals for the listener's ears is becoming increasingly widespread. Virtual surround is a method of rendering the sound such that audio sources are perceived as originating from a specific direction, thereby creating the illusion of listening to a physical surround sound setup (e.g. 5.1 speakers) or environment (concert). With an appropriate binaural rendering processing, the signals required at the eardrums for the listener to perceive sound from any direction can be calculated and the signals rendered such that they provide the desired effect. As illustrated in FIG. 5, these signals are then recreated at the eardrum using either headphones or a crosstalk cancelation method (suitable for rendering over closely spaced speakers).
Next to the direct rendering of FIG. 5, specific technologies that can be used to render virtual surround include MPEG Surround and Spatial Audio Object Coding, as well as the upcoming work item on 3D Audio in MPEG. These technologies provide for a computationally efficient virtual surround rendering.
The binaural rendering is based on binaural filters which vary from person to person due to different acoustic properties of the head and reflective surfaces such as the shoulders. For example, binaural filters can be used to create a binaural recording simulating multiple sources at various locations. This can be realized by convolving each sound source with the pair of Head Related Impulse Responses (HRIRs) that corresponds to the position of the sound source.
By measuring e.g. the impulse responses from a sound source at a specific location in 2D or 3D space at microphones placed in or near the human ears, the appropriate binaural filters can be determined. Typically, such measurements are made e.g. using models of human heads, or indeed in some cases the measurements may be made by attaching microphones close to the eardrums of a person. The binaural filters can be used to create a binaural recording simulating multiple sources at various locations. This can be realized e.g. by convolving each sound source with the pair of measured impulse responses for a position at the desired position of the sound source. In order to create the illusion that a sound source is moved around the listener, a large number of binaural filters is required with adequate spatial resolution, e.g. 10 degrees.
The binaural filter functions may be represented e.g. as a Head Related Impulse Responses (HRIR) or equivalently as Head Related Transfer Functions (HRTFs) or a Binaural Room Impulse Response (BRIR) or a Binaural Room Transfer Function (BRTF). The (e.g. estimated or assumed) transfer function from a given position to the listener's ears (or eardrums) is known as a head related binaural transfer function. This function may for example be given in the frequency domain in which case it is typically referred to as an HRTF or BRTF or in the time domain in which case it is typically referred to as a HRIR or BRIR. In some scenarios, the head related binaural transfer functions are determined to include aspects or properties factors of the acoustic environment and specifically of the room in which the measurements are made whereas in other examples only the user characteristics are considered. Examples of the first type of functions are the BRIRs and BRTFs, and examples of the latter type of functions are the HRIR and HRTF.
Accordingly, the underlying head related binaural transfer function can be represented in many different ways including HRIRs, HRTFs, etc. Furthermore, for each of these main representations, there are a large number of different ways to represent the specific function, e.g. with different levels of accuracy and complexity. Different processors may use different approaches and thus be based on different representations. Thus, a large number of head related binaural transfer functions are typically required in any audio system. Indeed, a large variety of how to represent head related binaural transfer functions exist and this is further exacerbated by a large variability of possible parameters for each head related binaural transfer functions. For example, a BRIR may sometimes be represented by a FIR filter with, say, 9 taps but in other scenarios by a FIR filter with, say, 16 taps etc. As another example, HRTFs can be represented in the frequency domain using a parameterized representation where a small set of parameters is used to represent a complete frequency spectrum.
It is in many scenarios desirable to allow for communicating parameters of a desired binaural rendering, such as the specific head related binaural transfer functions that may be used. However, due to the large variability in possible representations of the underlying head related binaural transfer function, it may be difficult to ensure commonality between the originating and receiving devices.
The Audio Engineering Society (AES) sc-02 technical committee has recently announced the start of a new project on the standardization of a file format to exchange binaural listening parameters in the form of head related binaural transfer functions. The format will be scalable to match the available rendering process. The format will be designed to include source materials from different HRTF databases. A challenge exists in how such multiple head related binaural transfer functions can be best supported, used and distributed in an audio system.
Accordingly, an improved approach for supporting binaural processing, and especially for communicating data for binaural rendering would be desired. In particular, an approach allowing improved representation and communication of binaural rendering data, reduced data rate, reduced overhead, facilitated implementation, and/or improved performance would be advantageous.