The rendering of audio has become increasingly varied and a range of rendering approaches and user experiences have been introduced. For example, spatial audio as part of an audio visual experience has become widespread, in particular in the form of surround sound. In such systems, an image or video is presented with an associated spatial audio environment being created.
In order to support the variation and flexibility in spatial audio rendering, a number of formats for representing spatial audio have been developed.
A recent format is the MPEG Surround format. However, although this provides a suitable format for many applications it is still not as flexible as desired for other applications. For example, audio is still produced and transmitted for a specific loudspeaker setup, e.g. an ITU 5.1 loudspeaker setup. Reproduction over different setups and over non-standard (i.e. flexible or user-defined) loudspeaker setups is not specified.
In order to provide for a more flexible representation of audio, formats are being developed that represent individual audio sources as individual audio objects. Thus, rather than represent an audio scene by audio channels corresponding to specific (nominal or reference) positions, it has been proposed to provide individual audio objects which each represent a specific audio source (including e.g. background, diffuse and ambient sound sources). Typically, the audio objects may be provided with (optional) position information which indicates a target position of the audio object in the sound stage. Thus, in such approaches, an audio source may be represented as a separate and single audio object rather than by the contribution it makes to audio channels associated with specific, predetermined positions.
In order to support such an approach, MPEG has standardized a format known as ‘Spatial Audio Object Coding’ (ISO/IEC MPEG-D SAOC). In contrast to multichannel audio coding systems such as DTS, Dolby Digital and MPEG Surround, SAOC provides efficient coding of individual audio objects rather than audio channels. Whereas in MPEG Surround, each loudspeaker channel can be considered to originate from a different mix of sound objects, SAOC allows for interactive manipulation of the location of the individual sound objects in a multi channel mix.
Similarly to MPEG Surround, SAOC also creates a mono or stereo downmix. In addition, object parameters are calculated and included. At the decoder side, the user may manipulate these parameters to control various features of the individual objects, such as position, level, equalization, or even to apply effects such as reverb.
SAOC allows a more flexible approach and in particular allows more rendering based adaptability by transmitting audio objects in addition to only reproduction channels. This allows the decoder-side to place the audio objects at arbitrary positions in space, provided that the space is adequately covered by loudspeakers. This way there is no relation between the transmitted audio and the reproduction or rendering setup, hence arbitrary loudspeaker setups can be used. This is advantageous for e.g. home cinema setups in a typical living room, where the loudspeakers are almost never at the intended positions. In SAOC, it is decided at the decoder side where the objects are placed in the sound scene. However, whereas rendering side manipulation of audio objects are supported, it is typically desired that the audio can be rendered without requiring user inputs while still providing a suitable sound stage. In particular, when the audio is provided together with a linked video signal, it is desired that the audio sources are rendered at positions corresponding to the positions in the image. Accordingly, audio objects may often be provided with target position data which indicates a suggested rendering position for the individual audio object.
Other examples of audio object based formats include MPEG-H 3D Audio [ISO/IEC 23008-3 (DIS): Information technology—High efficiency coding and media delivery in heterogeneous environments—Part 3: 3D audio, 2014.], ADM [EBU Tech 3364 “Audio Definition Model Ver. 1.0”, 2014] and proprietary standards such as Dolby Atmos [SMPTE TC-25CSS10 WG on “Interoperable Immersive Sound Systems for Digital Cinema”, 2014] and DTS-MDA [ETSI document TS 103 223, “The Multi-Dimensional Audio (MDA) Content Creation Format Specification with Extensions for Consumer Environments”, 2014].
The concept of object-based audio production and reproduction offers many advantages over the traditional channel-based approach. In particular, the possibility to assign a specific position in space to individual sound objects offers a large degree of flexibility, scalability and new possibilities for interactivity.
If suitable audio rendering techniques are used, object-based audio enables positioning an object in a perceptually realistic way at any position in 3D space, including accurate localization of azimuth, elevation and distance relative to the listener. Some examples of such rendering techniques are: binaural headphones reproduction, transaural loudspeaker reproduction, Wave Field Synthesis loudspeaker reproduction and, to some extent, VBAP loudspeaker reproduction.
Typically, object-based audio content is presented together with corresponding video content rendered on a video display. If an audio object corresponds to a visual object that is present on the screen, it is usually desirable that there is some spatial synchronization or congruency between the perceived auditory- and visual object positions, i.e. that the sound and image of the object match in space. If such synchronization is absent, i.e. if the perceived positions of the auditory- and corresponding visual object are significantly different, this may be confusing to the user and degrade the overall perceived quality or immersion of the audio-visual presentation.
However, as the rendering setups, and especially the video rendering setups, typically vary substantially, it can be difficult to achieve tight spatial synchronization and this may in many situations result in a degraded user experience. In particular, the capabilities and rendering characteristics of different displays may vary substantially and this may cause different rendering in different scenarios.
Hence, an improved approach for processing spatial audio signals for rendering would be advantageous, and in particular an approach allowing increased flexibility, facilitated operation, reduced complexity and/or resource demand, improved spatial synchronization to associated video and/or an improved user experience would be advantageous.