In recent decades, the variety and flexibility of audio applications has increased immensely with e.g. the variety of audio rendering applications varying substantially. On top of that, the audio rendering setups are used in diverse acoustic environments and for many different applications.
Traditionally, spatial sound reproduction systems have always been developed for one or more specified loudspeaker configurations. As a result, the spatial experience is dependent on how closely the actual loudspeaker configuration used matches the defined nominal configuration, and a high quality spatial experience is typically only achieved for a system that has been set up substantially correctly, i.e. according to the specified loudspeaker configuration.
However, the requirement to use specific loudspeaker configurations with typically a relatively high number of loudspeakers is cumbersome and disadvantageous. Indeed, a significant inconvenience perceived by consumers when deploying e.g. home cinema surround sound systems is the need for a relatively large number of loudspeakers to be positioned at specific locations. Typically, practical surround sound loudspeaker setups will deviate from the ideal setup due to users finding it impractical to position the loudspeakers at the optimal locations, for example due to restrictions on available speaker locations in a living room. Accordingly the experience, and in particular the spatial experience, which is provided by such setups is suboptimal.
In recent years, there has therefore been a strong trend towards consumers demanding less stringent requirements for the location of their loudspeakers. Even more so, their primary requirement is that the loudspeaker set-up fits their home environment, while at the same time they of course expect the system to still provide a high quality sound experience and in particular an accurate spatial experience. These conflicting requirements become more prominent as the number of loudspeakers increases. Furthermore, the issues have become more relevant due to a current trend towards the provision of full three dimensional sound reproduction with sound coming to the listener from multiple directions.
Audio encoding formats have been developed to provide increasingly capable, varied and flexible audio services and in particular, audio encoding formats supporting spatial audio services have been developed.
Well known audio coding technologies like MPEG, DTS and Dolby Digital produce a coded multi-channel audio signal that represents the spatial image as a number of channels placed around the listener at fixed positions. For a loudspeaker setup which is different from the setup that corresponds to the multi-channel signal, the spatial image will be suboptimal. Also, channel based audio coding systems are typically not able to cope with a different number of loudspeakers.
(ISO/IEC) MPEG-2 provides a multi-channel audio coding tool where the bitstream format comprises both a 2 channel and a 5 multichannel mix of the audio signal. When decoding the bitstream with a (ISO/IEC) MPEG-1 decoder, the 2 channel backwards compatible mix is reproduced. When decoding the bitstream with a MPEG-2 decoder, three auxiliary data channels are decoded that when combined (de-matrixed) with the stereo channels result in the 5 channel mix of the audio signal.
(ISO/IEC MPEG-D) MPEG Surround provides a multi-channel audio coding tool that allows existing mono- or stereo-based coders to be extended to multi-channel audio applications. FIG. 1 illustrates an example of the elements of an MPEG Surround system. Using spatial parameters obtained by analysis of the original multichannel input, an MPEG Surround decoder can recreate the spatial image by a controlled upmix of the mono- or stereo signal to obtain a multichannel output signal.
Since the spatial image of the multi-channel input signal is parameterized, MPEG Surround allows for decoding of the same multi-channel bit-stream by rendering devices that do not use a multichannel loudspeaker setup. An example is virtual surround reproduction on headphones, which is referred to as the MPEG Surround binaural decoding process. In this mode a realistic surround experience can be provided while using regular headphones. Another example is the pruning of higher order multichannel outputs, e.g. 7.1 channels, to lower order setups, e.g. 5.1 channels.
As mentioned, the variation and flexibility in the rendering configurations used for rendering spatial sound has increased significantly in recent years with more and more reproduction formats becoming available to the mainstream consumer. This requires a flexible representation of audio. Important steps have been taken with the introduction of the MPEG Surround codec. Nevertheless, audio is still produced and transmitted for a specific loudspeaker setup, e.g. an ITU 5.1 loudspeaker setup. Reproduction over different setups and over non-standard (i.e. flexible or user-defined) loudspeaker setups is not specified. Indeed, there is a desire to make audio encoding and representation increasingly independent of specific predetermined and nominal loudspeaker setups. It is increasingly preferred that flexible adaptation to a wide variety of different loudspeaker setups can be performed at the decoder/rendering side.
In order to provide for a more flexible representation of audio, MPEG standardized a format known as ‘Spatial Audio Object Coding’ (ISO/IEC MPEG-D SAOC). In contrast to multichannel audio coding systems such as DTS, Dolby Digital and MPEG Surround, SAOC provides efficient coding of individual audio objects rather than audio channels. Whereas in MPEG Surround, each loudspeaker channel can be considered to originate from a different mix of sound objects, SAOC allows for interactive manipulation of the location of the individual sound objects in a multichannel mix as illustrated in FIG. 2.
Similarly to MPEG Surround, SAOC also creates a mono or stereo downmix. In addition, object parameters are calculated and included. At the decoder side, the user may manipulate these parameters to control various features of the individual objects, such as position, level, equalization, or even to apply effects such as reverb. FIG. 3 illustrates an interactive interface that enables the user to control the individual objects contained in an SAOC bitstream. By means of a rendering matrix individual sound objects are mapped onto loudspeaker channels.
SAOC allows a more flexible approach and in particular allows more rendering based adaptability by transmitting audio objects in addition to only reproduction channels. This allows the decoder-side to place the audio objects at arbitrary positions in space, provided that the space is adequately covered by loudspeakers. This way there is no relation between the transmitted audio and the reproduction or rendering setup, hence arbitrary loudspeaker setups can be used. This is advantageous for e.g. home cinema setups in a typical living room, where the loudspeakers are almost never at the intended positions. In SAOC, it is decided at the decoder side where the objects are placed in the sound scene (e.g. by means of an interface as illustrated in FIG. 3), which may not always be desired from an artistic point-of-view. The SAOC standard does provide ways to transmit a default rendering matrix in the bitstream, eliminating the decoder responsibility. However the provided methods rely on either fixed reproduction setups or on unspecified syntax. Thus SAOC does not provide normative means to fully transmit an audio scene independently of the loudspeaker setup. Also, SAOC is not well equipped to the faithful rendering of diffuse signal components. Although there is the possibility to include a so called Multichannel Background Object (MBO) to capture the diffuse sound, this object is tied to one specific loudspeaker configuration.
Another specification for an audio format for 3D audio has been developed by DTS Inc. (Digital Theater Systems). DTS, Inc. has developed Multi-Dimensional Audio (MDA™) an open object-based audio creation and authoring platform to accelerate next-generation content creation. The MDA platform supports both channel and audio objects and adapts to any loudspeaker quantity and configuration. The MDA format allows the transmission of a legacy multichannel downmix along with individual sound objects. In addition, object positioning data is included. The principle of generating an MDA audio stream is illustrated in FIG. 4.
In the MDA approach, the sound objects are received separately in the extension stream and these may be extracted from the multi-channel downmix. The resulting multi-channel downmix is rendered together with the individually available objects.
The objects may consist of so called stems. These stems are basically grouped (downmixed) tracks or objects. Hence, an object may consist of multiple sub-objects packed into a stem. In MDA, a multichannel reference mix can be transmitted with a selection of audio objects. MDA transmits the 3D positional data for each object. The objects can then be extracted using the 3D positional data. Alternatively, the inverse mix-matrix may be transmitted, describing the relation between the objects and the reference mix.
From the MDA description, sound-scene information is likely transmitted by assigning an angle and distance to each object, indicating where the object should be placed relative to e.g. the default forward direction. Thus, positional information is transmitted for each object. This is useful for point-sources but fails to describe wide sources (like e.g. a choir or applause) or diffuse sound fields (such as ambiance). When all point-sources are extracted from the reference mix, an ambient multichannel mix remains. Similar to SAOC, the residual in MDA is fixed to a specific loudspeaker setup.
Thus, both the SAOC and MDA approaches incorporate the transmission of individual audio objects that can be individually manipulated at the decoder side. A difference between the two approaches is that SAOC provides information on the audio objects by providing parameters characterizing the objects relative to the downmix (i.e. such that the audio objects are generated from the downmix at the decoder side) whereas MDA provides audio objects as full and separate audio objects (i.e. that can be generated independently from a downmix at the decoder side). For both approaches, position data may be communicated for the audio objects.
Currently, within ISO/IEC MPEG, a standard MPEG-H 3D Audio is being prepared to facilitate the transport and rendering of 3D audio. MPEG-H 3D Audio is intended to become part of the MPEG-H suite along with HEVC video coding and MMT (MPEG Media Transport) systems layer. FIG. 5 illustrates the current high level block diagram of the intended MPEG 3D Audio system.
In addition to the traditional channel based format, the approach is intended to also support object based and scene based formats. An important aspect of the system is that its quality should scale to transparency for increasing bitrate, i.e. that as the data rate increases the degradation caused by the encoding and decoding should continue to reduce until it is insignificant. However, such a requirement tends to be problematic for parametric coding techniques that have been used quite heavily in the past (viz. MPEG-4 HE-AAC v2, MPEG Surround, MPEG-D SAOC and MPEG-D USAC). In particular, the compensation of information loss for the individual signals tends to not be fully compensated by the parametric data even at very high bit rates. Indeed, the quality will be limited by the intrinsic quality of the parametric model.
MPEG-H 3D Audio furthermore seeks to provide a resulting bitstream which is independent of the reproduction setup. Envisioned reproduction possibilities include flexible loudspeaker setups up to 22.2 channels, as well as virtual surround over headphones and closely spaced loudspeakers.
In summary, the majority of existing sound reproduction systems only allow for a modest amount of flexibility in terms of loudspeaker set-up. Because almost every existing system has been developed from certain basic assumptions regarding either the general configuration of the loudspeakers (e.g. loudspeakers positioned more or less equidistantly around the listener, or loudspeakers arranged on a line in front of the listener, or headphones), or regarding the nature of the content (e.g. consisting of a small number of separate localizable sources, or consisting of a highly diffuse sound scene), every system is only able to deliver an optimal experience for a limited range of loudspeaker configurations that may occur in the rendering environment (such as in a user's home). A new class of sound rendering systems that allow a flexible loudspeaker set-up is therefore desired.
Thus, various activities are currently undertaken in order to develop more flexible audio systems. In particular, audio standardization activity to develop the audio standard known as the ISO/IEC MPEG-H 3D audio standard is undertaken with the aim of providing a single efficient format that delivers immersive audio experiences to consumers for headphones and flexible loudspeaker set-ups.
The activity acknowledges that that most consumers are not able and/or willing (e.g. due to physical limitations of the room) to comply with the standardized loudspeaker set-up requirements of conventional standards. Instead, they place their loudspeakers in their home environment wherever it suits them, which in general results in a sub-optimal sound experience. Given the fact that this is simply the everyday reality, the MPEG-H 3D Audio initiative aims to provide the consumer with an optimal experience given his preferred loudspeaker set-up. Thus, rather than assuming that the loudspeakers are at any specific positions, and thus requiring the user to adapt the loudspeaker setup to the requirements of the audio standard, the initiative seeks to develop an audio system which adapts to any specific loudspeaker configuration that the user has established.
The reference renderer in the MPEG-H 3D Audio Call for Proposals is based on the use of Vector Base Amplitude Panning (VBAP). This is a well-established technology that corrects for deviations from standardized loudspeaker configurations (e.g. 5.1, 7.1 or 22.2) by applying re-panning of sources/channels between pairs of loudspeakers (or triplets in set-ups including loudspeakers at different heights).
VBAP is generally considered to be the reference technology for correcting for non-standard loudspeaker placement due to it offering a reasonable solution in many situations. However, it has also become clear that there are limitations to the deviations of the loudspeaker positions that the technology can effectively handle. For example, since VBAP relies on amplitude panning it does not give very satisfactory results in use-cases with large gaps between loudspeakers, especially between front and rear. Also, it is completely incapable of handling a use-case with surround content and only front loudspeakers. Another specific use-case in which VBAP gives sub-optimal results is when a subset of the available loudspeakers is clustered within a small region, such as e.g. around (or maybe even integrated in) a TV. Accordingly, improved rendering and adaptation approaches would be desirable.
Hence, an improved audio rendering approach would be advantageous and in particular an approach allowing increased flexibility, facilitated implementation and/or operation, allowing a more flexible positioning of loudspeakers, improved adaptation to different loudspeaker configurations and/or improved performance would be advantageous.