Digital encoding of various source signals has become increasingly important over the last decades as digital signal representation and communication increasingly has replaced analogue representation and communication. For example, audio content, such as speech and music, is increasingly based on digital content encoding.
Audio encoding formats have been developed to provide increasingly capable, varied and flexible audio services and in particular audio encoding formats supporting spatial audio services have been developed.
Well known audio coding technologies like DTS and Dolby Digital produce a coded multi-channel audio signal that represents the spatial image as a number of channels that are placed around the listener at fixed positions. For a speaker setup that is different from the setup that corresponds to the multi-channel signal, the spatial image will be suboptimal. Also, these channel based audio coding systems are typically not able to cope with a different number of speakers.
MPEG Surround provides a multi-channel audio coding tool that allows existing mono- or stereo-based coders to be extended to multi-channel audio applications. FIG. 1 illustrates an example of elements of an MPEG Surround system. Using spatial parameters obtained by analysis of the original multichannel input, an MPEG Surround decoder can recreate the spatial image by a controlled upmix of the mono- or stereo signal to obtain a multichannel output signal.
Since the spatial image of the multi-channel input signal is parameterized, MPEG Surround allows for decoding of the same multi-channel bit-stream by rendering devices that do not use a multichannel speaker setup. An example is virtual surround reproduction on headphones, which is referred to as the MPEG Surround binaural decoding process. In this mode a realistic surround experience can be provided while using regular headphones. Another example is the pruning of higher order multichannel outputs, e.g. 7.1 channels, to lower order setups, e.g. 5.1 channels.
In order to provide for a more flexible representation of audio, MPEG standardized a format known as ‘Spatial Audio Object Coding’ (MPEG-D SAOC). In contrast to multichannel audio coding systems such as DTS, Dolby Digital and MPEG Surround, SAOC provides efficient coding of individual audio objects rather than audio channels. Whereas in MPEG Surround, each speaker channel can be considered to originate from a different mix of sound objects, SAOC makes individual sound objects available at the decoder side for interactive manipulation as illustrated in FIG. 2. In SAOC, multiple sound objects are coded into a mono or stereo downmix together with parametric data allowing the sound objects to be extracted at the rendering side thereby allowing the individual audio objects to be available for manipulation e.g. by the end-user.
Indeed, similarly to MPEG Surround, SAOC also creates a mono or stereo downmix. In addition object parameters are calculated and included. At the decoder side, the user may manipulate these parameters to control various features of the individual objects, such as position, level, equalization, or even to apply effects such as reverb. FIG. 3 illustrates an interactive interface that enables the user to control the individual objects contained in an SAOC bitstream. By means of a rendering matrix individual sound objects are mapped onto speaker channels.
Indeed, the variation and flexibility in the rendering configurations used for rendering spatial sound has increased significantly in recent years with more and more reproduction formats becoming available to the mainstream consumer. This requires flexible representation of audio. Important steps have been taken with the introduction of the MPEG Surround codec. Nevertheless, audio is still produced and transmitted for a specific loudspeaker setup. Reproduction over different setups and over non-standard (i.e. flexible or user-defined) speaker setups is not specified.
This problem can be partly solved by SAOC, which transmits audio objects instead of reproduction channels. This allows the decoder-side to place the audio objects at arbitrary positions in space, provided that the space is adequately covered by speakers. This way there is no relation between the transmitted audio and the reproduction setup, hence arbitrary speaker setups can be used. This is advantageous for e.g. home cinema setups in a typical living room, where the speakers are almost never at the intended positions. In SAOC, it is decided at the decoder side where the objects are placed in the sound scene, which is often not desired from an artistic point-of-view. The SAOC standard does provide ways to transmit a default rendering matrix in the bitstream, eliminating the decoder responsibility. However the provided methods rely on either fixed reproduction setups or on unspecified syntax. Thus SAOC does not provide normative means to transmit an audio scene independently of the speaker setup. More importantly, SAOC is not well equipped to the faithful rendering of diffuse signal components. Although there is the possibility to include a so called multichannel background object to capture the diffuse sound, this object is tied to one specific speaker configuration.
Another specification for an audio format for 3D audio is being developed by the 3D Audio Alliance (3DAA) which is an industry alliance initiated by SRS (Sound Retrieval System) Labs. 3DAA is dedicated to develop standards for the transmission of 3D audio, that “will facilitate the transition from the current speaker feed paradigm to a flexible object-based approach”. In 3DAA, a bitstream format is to be defined that allows the transmission of a legacy multichannel downmix along with individual sound objects. In addition, object positioning data is included. The principle of generating a 3DAA audio stream is illustrated in FIG. 4.
In the 3DAA approach, the sound objects are received separately in the extension stream and these may be extracted from the multi-channel downmix. The resulting multi-channel downmix is rendered together with the individually available objects.
The objects may consist of so called stems. These stems are basically grouped (downmixed) tracks or objects. Hence, an object may consist of multiple sub-objects packed into a stem. In 3DAA, a multichannel reference mix can be transmitted with a selection of audio objects. 3DAA transmits the 3D positional data for each object. The objects can then be extracted using the 3D positional data. Alternatively, the inverse mix-matrix may be transmitted, describing the relation between the objects and the reference mix.
From the description of 3DAA, sound-scene information is likely transmitted by assigning an angle and distance to each object, indicating where the object should be placed relative to e.g. the default forward direction. This is useful for point-sources but fails to describe wide sources (like e.g. a choir or applause) or diffuse sound fields (such as ambiance). When all point-sources are extracted from the reference mix, an ambient multichannel mix remains. Similar to SAOC, the residual in 3DAA is fixed to a specific speaker setup.
Thus, both the SAOC and 3DAA approaches incorporate the transmission of individual audio objects that can be individually manipulated at the decoder side. A difference between the two approaches is that SAOC provides information on the audio objects by providing parameters characterizing the objects relative to the downmix (i.e. such that the audio objects are generated from the downmix at the decoder side) whereas 3DAA provides audio objects as full and separate audio objects (i.e. that can be generated independently from the downmix at the decoder side).
A typical audio scene will comprise different types of sound. In particular, an audio scene will often include a number of specific and spatially well-defined audio sources. In addition, the audio scene may typically contain diffuse sound components representing the general ambient audio environment. Such diffuse sounds may include e.g. reverberation effects, non-directional noise, etc.
A critical problem is how to handle such different audio types and in particular how to handle such different types of audio in different speaker configurations. Formats such as SAOC and 3DAA can flexibly render point sources. However, although such approaches may be advantageous over channel based approaches, the rendering of diffuse sound sources at different speaker configurations is suboptimal.
A different approach for differentiating the rendering of sound point sources and diffuse sounds have been proposed in the article “Spatial Sound Reproduction with Directional Audio Coding”, by Ville Pulkki, Journal Audio Engineering Society, Vol. 55, No. 6, June 2007. The article proposes an approach referred to as DirAC (Directional Audio Coding) wherein a downmix is transmitted along with parameters that enable a reproduction of a spatial image at the synthesis side. The parameters communicated in DirAC are obtained by a direction and diffuseness analysis. Specifically, DirAC discloses that in addition to communicating azimuth and elevation for sound sources, a diffuseness indication is also communicated. During synthesis the downmix is divided dynamically into two streams, one that corresponds to non-diffuse sound, and another that corresponds to the diffuse sound. The non-diffuse sound stream is reproduced with a technique aiming at point like sound sources, and the diffuse sound stream is rendered by a technique aiming at the perception of sound which lacks prominent direction.
The downmixes described in the article are either a mono or a B-format type of downmix. In the case of a mono downmix, diffuse speaker signals are obtained by decorrelating the downmix using a separate decorrelator for each loudspeaker position. In the case of a B-format downmix, virtual microphone signals are extracted for each loudspeaker position from the B-format modeling cardioids in the direction of the reproduction speakers. These signals are split in a part representing the directional sources and a part representing diffuse sources. For the diffuse components, decorrelated versions of the ‘virtual signals’ are added to the obtained point source contribution for each loudspeaker position.
However, although DirAC provides an approach that may improve audio quality over some systems that do not consider separate processing of spatially defined sound sources and diffuse sounds, it tends to provide suboptimal sound quality. In particular, when adapting the system to different speaker configurations, the specific rendering of diffuse sounds based only on a relatively simple division of downmix signals into diffuse/non-diffuse components tend to result in a less than ideal rendering of the diffuse sound. In DirAC, the energy of the diffuse signal component is directly determined by the point sources present in the input signal. Therefore, it is not possible to e.g. generate a truly diffuse signal in the presence of point sources.
Hence, an improved approach would be advantageous and in particular an approach allowing increased flexibility, improved audio quality, improved adaptation to different rendering configurations, improved rendering of diffuse sounds and/or audio point sources of a sound scene and/or improved performance would be advantageous.