Digital encoding of various source signals has become increasingly important over the last decades as digital signal representation and communication increasingly has replaced analogue representation and communication. For example, audio content, such as speech and music, is increasingly based on digital content encoding.
Audio encoding formats have been developed to provide increasingly capable, varied and flexible audio services and in particular audio encoding formats supporting spatial audio services have been developed.
Well known spatial audio coding technologies like DTS and Dolby Digital produce a coded multi-channel audio signal that represents the spatial image as a number of channels that are placed around the listener at fixed positions. For a speaker setup which is different from the setup that corresponds to the multi-channel signal, the spatial image will be suboptimal. Also, these channel based audio coding systems are typically not able to cope with a different number of speakers.
The approach of such conventional approaches is illustrated in FIG. 1 (where the letter c refers to audio channel). The input channels (e.g. 5.1 channels) are provided to an encoder that performs matrixing to exploit inter-channel relations, following by coding of the matrixed signal into a bit-stream. In addition the matrixing information may also be conveyed to the decoder as part of the bitstream. At the decoder side this process is reversed.
MPEG Surround provides a multi-channel audio coding tool that allows existing mono- or stereo-based coders to be extended to multi-channel audio applications. FIG. 2 illustrates an example of elements of an MPEG Surround system. Using spatial parameters obtained by analysis of the original multichannel input, an MPEG Surround decoder can recreate the spatial image by a controlled upmix of the mono- or stereo signal to obtain a multichannel output signal.
Since the spatial image of the multi-channel input signal is parameterized, MPEG Surround allows for decoding of the same multi-channel bit-stream by rendering devices that do not use a multichannel speaker setup. An example is virtual surround reproduction on headphones, which is referred to as the MPEG Surround binaural decoding process. In this mode a realistic surround experience can be provided while using regular headphones. Another example is the transformation of higher order multichannel outputs, e.g. 7.1 channels, to lower order setups, e.g. 5.1 channels.
The approach of MPEG Surround (and similar parametric multi-channel coding approaches such as Binaural Cue Coding or Parametric Stereo) is illustrated in FIG. 3. In contrast to the discrete or waveform coding approach, the input channels are downmixed (e.g. to a stereo mix). This downmix is subsequently coded using traditional coding techniques such as the AAC family of codecs. In addition to the coded downmix, a representation of the spatial image is also transmitted in the bit-stream. The decoder reverses the process.
In order to provide for a more flexible representation of audio, MPEG standardized a format known as ‘Spatial Audio Object Coding’ (MPEG-D SAOC). In contrast to multichannel audio coding systems such as DTS, Dolby Digital and MPEG Surround, SAOC provides efficient coding of individual audio objects rather than audio channels. Whereas in MPEG Surround, each speaker channel can be considered to originate from a different mix of sound objects, SAOC makes individual sound objects available at the decoder side for interactive manipulation as illustrated in FIG. 4. In SAOC, multiple sound objects are coded into a mono or stereo downmix together with parametric data allowing the sound objects to be extracted at the rendering side thereby allowing the individual audio objects to be available for manipulation e.g. by the end-user.
Indeed, similarly to MPEG Surround, SAOC also creates a mono or stereo downmix. In addition object parameters are calculated and included. At the decoder side, the user may manipulate these parameters to control various features of the individual objects, such as position, level, equalization, or even to apply effects such as reverb. FIG. 5 illustrates an interactive interface that enables the user to control the individual objects contained in an SAOC bitstream. By means of a rendering matrix individual sound objects are mapped onto speaker channels.
FIG. 6 provides a high level block diagram of a parametric approach of SAOC (or similar object coding systems). The object signals (o) are downmixed and the resulting downmix is coded. In addition parametric object data is transmitted in the bit-stream relating the individual objects to the downmix. At the decoder side, the objects are decoded and rendered to channels according to the speaker configuration. Typically, in such an approach it is more efficient to combine the decoding of the objects and the speaker rendering.
The variation and flexibility in the rendering configurations used for rendering spatial sound has increased significantly in recent years with more and more reproduction formats becoming available to the mainstream consumer. This requires flexible representation of audio. Important steps have been taken with the introduction of the MPEG Surround codec. Nevertheless, audio is still produced and transmitted for a specific loudspeaker setup. Reproduction over different setups and over non-standard (i.e. flexible or user-defined) speaker setups is not specified.
This problem can be partly solved by SAOC, which transmits audio objects instead of reproduction channels. This allows the decoder-side to place the audio objects at arbitrary positions in space, provided that the space is adequately covered by speakers. This way there is no relation between the transmitted audio and the reproduction setup, hence arbitrary speaker setups can be used. This is advantageous for e.g. home cinema setups in a typical living room, where the speakers are almost never at the intended positions because of the layout of living room. In SAOC, it is decided at the decoder side where the objects are placed in the sound scene. This is often not desired from an artistic point-of-view, and therefore the SAOC standard does provide ways to transmit a default rendering matrix in the bitstream, eliminating the decoder responsibility. These rendering matrices are again tied to specific speaker configurations.
In SAOC, as a result of the downmixing, the object extraction only works within certain boundaries. It is typically not possible to extract a single object with high enough separation from the other objects for reproduction without the other objects, e.g. in a Karaoke use case. Furthermore, because of the parameterization, the SAOC technology does not scale well with bitrate. In particular, the approach of downmixing and extracting (upmixing) audio objects results in some inherent information loss that is not fully compensated even at very high bitrates. Thus, even if the bitrate is increased, the resulting audio quality is typically degraded and prevents the encoding/decoding operations from being fully transparent.
In order to address this, SAOC supports so called residual coding which can be applied for a limited set of objects (up to and including 4, which has been a design choice). The residual coding basically transmits additional bitstream components that code the error signals (including the crosstalk from the other objects in that object) such that a limited number of objects can be extracted with a high degree of object separation. Residual waveform components may be supplied up to a specific frequency such that the quality can be gradually increased. The resulting object is thus a combination of a parametric component and a waveform component.
Another specification for an audio format for 3D audio is being developed by the 3D Audio Alliance (3DAA) which is an industry alliance initiated by SRS (Sound Retrieval System) Labs. 3DAA is dedicated to develop standards for the transmission of 3D audio, that “will facilitate the transition from the current speaker feed paradigm to a flexible object-based approach”. In 3DAA, a bitstream format is to be defined that allows the transmission of a legacy multichannel downmix along with individual sound objects. In addition, object positioning data is included. The principle of generating a 3DAA audio stream is illustrated in FIG. 7.
In the 3DAA approach, the sound objects are received separately in the extension stream and these may be extracted from the multi-channel downmix. The resulting multi-channel downmix is rendered together with the individually available objects.
In 3DAA, a multichannel reference mix can be transmitted with a selection of audio objects. 3DAA transmits the 3D positional data for each object. The objects can then be extracted using the 3D positional data. Alternatively, the inverse mix-matrix may be transmitted, describing the relation between the objects and the reference mix. The illustration of FIG. 6 may be considered to also correspond to the approach of 3DAA.
Both the SAOC and 3DAA approaches incorporate the transmission of individual audio objects that can be individually manipulated at the decoder side. A difference between the two approaches is that SAOC provides information on the audio objects by providing parameters characterizing the objects relative to the downmix (i.e. such that the audio objects are generated from the downmix at the decoder side) whereas 3DAA provides audio objects as full and separate audio objects (i.e. that can be generated independently from the downmix at the decoder side).
In MPEG a new work item on 3D Audio is under construction. This is referred to as MPEG-3D Audio and is intended to become part of the MPEG-H suite along with HEVC video coding and DASH systems. FIG. 8 illustrates the current high level block diagram of the intended MPEG 3D Audio system.
In addition to the traditional channel based format, the approach is intended to also support object based and scene based formats. An important aspect of the system is that its quality should scale to transparency for increasing bitrate, i.e. that as the data rate increases the degradation caused by the encoding and decoding should continue to reduce until it is insignificant. However, such a requirement tends to be problematic for parametric coding techniques that have been used quite heavily in the past (viz. HE-AAC v2, MPEG Surround, SAOC, USAC). In particular, the compensation of information loss for the individual signals tends to not be fully compensated by the parametric data even at very high bit rates. Indeed, the quality will be limited by the intrinsic quality of the parametric model.
MPEG-3D Audio furthermore seeks to provide a resulting bitstream which is independent of the reproduction setup. Envisioned reproduction possibilities include flexible loudspeaker setups up to 22.2 channels, as well as virtual surround over headphones and closely spaced speakers.
Another approach is known as DirAC—Directional Audio Coding (DirAC) which is similar to MPEG Surround and SAOC in the sense that a downmix is transmitted along with parameters that enable a reproduction of a spatial image at the synthesis side. In DirAC these parameters represent results from direction and diffuseness analysis (azimuth, elevation and diffuseness Ψ(t/f)). During synthesis the downmix is divided dynamically into two streams, one that corresponds to non-diffuse sound (weight √{square root over (1−Ψ)}), and another that corresponds to the diffuse sound (weight √{square root over (Ψ)}). The non-diffuse sound stream is reproduced with a technique aiming at point-like sound sources, and the diffuse sound stream with a technique aiming at the perception of sound lacking prominent direction. The approach of DirAC is illustrated in FIG. 9.
DirAC can be considered a recording based encoding/decoding system in accordance with the approach of FIG. 10. In the system, the microphone signals (m) are coded. This can e.g. be performed similarly to the parametric approach using downmixing and coding of spatial information. At the decoder, the microphone signals can be reconstructed, and based on a provided speaker configuration, the microphone signals can be rendered to channels. It is noted that for efficiency reasons, the decoding and rendering process can be integrated into a single step.
In “The continuity illusion revisited: coding of multiple concurrent sound sources”, M. Kelly et. al. Proc. MPCA-2002, Louvain, Belgium, Nov. 15, 2002 it is suggested to not use parametric encoding and downmixing but instead to encode the individual audio objects individually using discrete/waveform encoding. The approach is illustrated in FIG. 11. As illustrated, all objects are coded simultaneously and transmitted to the decoder. At the decoder side, the objects are decoded and rendered according to a speaker configuration to channels. The approach may provide improved audio quality, and in particular has the potential of scaling to transparency. However, the system does not provide significant coding efficiency and requires relative high data rates even for lower audio quality.
Thus, there are a number of different approaches seeking to provide efficient audio encoding.
Audio content is nowadays shared between an increasing number of different reproduction devices. For example, the audio may be experienced over headphones, small speakers, via a docking station, and/or using various multichannel setups. For multichannel setups, the ITU recommended 5.1 speaker setup, which conventionally has been assumed as the nominal speaker setup, is often not even approximately applied when rendering the audio content. For example, an accurate positioning of five spatial speakers in accordance with the setup is rarely found in typical living rooms. Speakers are placed at convenient locations instead of at the recommended angles and distances. Furthermore, alternative setups like 4.1, 6.1, 7.1 or even 22.2 configurations may be used. In order to provide the best experience in all of these reproduction schemes, a trend towards object coding or scene coding can be observed. Such approaches are increasingly introduced (currently mainly for cinema applications but domestic use is expected to become more common) to replace the conventional audio channel approach where each audio channel is associated with a nominal position.
When the number of reproduction channels (i.e. speakers) and their locations are unknown, an audio scene can best be represented by the individual audio objects in the scene. At the decoder side the objects can then each be rendered separately on the reproduction channels such that the spatial perception is closest to the intended perception.
Coding the objects as separate audio signals/streams requires a relatively high bitrate. The available solutions (viz. SAOC, DirAC, 3DAA, etc) transmit downmixed object signals and means to reconstruct the object signals from this downmix. This results in a significant bitrate reduction.
SAOC provides speaker independent audio by efficient object coding in a downmix with object extraction parameters, 3DAA defines a format where the scene is described in terms of object positions. DirAC attempts an efficient coding of audio objects by using a B-format downmix.
Thus, these systems are suitable for efficient and flexible coding and rendering of audio content. Significant data rate reductions can be achieved and accordingly relatively low data rate implementations can still provide reasonable or good audio quality. However, an issue with such systems is that the audio quality is inherently limited by the parametric encoding and downmixing. Even as the available data rate is increased, it is not possible to achieve full transparency where the impact of the encoding/decoding operations cannot be detected. In particular, objects cannot be reconstructed without cross-talk from other objects even at high data rates. This results in a reduction of audio quality and spatial perception when objects are separated in spatial reproduction (i.e. rendered at different positions). A further drawback is that inter-object coherence is mostly not reconstructed properly, which is an important characteristic for creating spatial perception. Attempts to reconstruct the coherence are based on use of decorrelators and tend to result in suboptimal audio quality.
An alternative approach of individually waveform encoding the audio objects may allow high quality at high data rates, and may in particular provide full scalability including a full transparent encoding/decoding. However, such approaches are unsuitable for low data rates where they do not provide an efficient encoding.
Thus, parametric downmix based encodings are suitable for low data rates and scalability towards lower data rates whereas waveform object encodings are suitable for high data rates and scalability towards high data rates.
Scalability is a very important criterion for future audio systems, and therefore it is highly desirable to have efficient scalability that extends to both very low data rates and to very high data rates, and in particular to full transparency. Furthermore, it is desirable that such scalability has a low granularity of the scalability.
Hence, an improved audio coding/decoding approach would be advantageous and in particular a system allowing increased flexibility, reduced complexity, improved scalability and/or improved performance would be advantageous.