The advent of object-based audio has significantly increased the amount of audio data and the complexity of rendering this data within high-end playback or rendering systems. For example, cinema sound tracks may comprise many different sound elements corresponding to images on the screen, dialog, noises, and sound effects that emanate from different places on the screen and combine with background music and ambient effects to create the overall auditory experience. Accurate playback by a renderer requires that sounds be reproduced in a way that corresponds as closely as possible to what is shown on screen with respect to sound source position, intensity, movement, and depth. Object-based audio represents a significant improvement over traditional channel-based audio systems that send audio content in the form of speaker feeds to individual speakers in a listening environment, and are thus relatively limited with respect to spatial playback of specific audio objects.
In order to make object-based audio (also referred to as immersive audio) backward-compatible with channel-based rendering devices and/or in order to reduce the data rate of object-based audio, it may be beneficial to perform a downmix of some or all of the audio objects into one or more audio channels, e.g. into 5.1 or 7.1 audio channels. The downmix channels may be provided along with metadata which describes the properties of the original audio objects, and which allows a corresponding audio decoder to recreate (an approximation of) the original audio objects.
Furthermore, so called unified object and channel coding systems may be provided which are configured to process a combination of object-based audio and channel-based audio. Unified object and channel encoders typically provide metadata which is referred to as side information (sideinfo) and which may be used by a decoder to perform a parameterized upmix of one or more downmix channels to one or more audio objects. Furthermore, unified object and channel encoders may provide object audio metadata (referred to herein as OAMD) which may describe the position, the gain and other properties of an audio object, e.g. of an audio object which has been re-created using the parameterized upmix.
As indicated above, unified object and channel encoders (also referred to as immersive audio encoding systems) may be configured to provide a backward-compatible multi-channel downmix (e.g. a 5.1 channel downmix). The provision of such a backward-compatible downmix is beneficial, as it allows for the use of low complexity decoders in legacy playback systems. Even if the downmix channels which have been generated by the encoder are not directly backward-compatible, additional downmix metadata may be provided which allows the downmix channels to be transformed into backward-compatible downmix channels, thereby allowing the use of low complexity decoders for the playback of the audio within a legacy playback system. This additional downmix metadata may be referred to as “SimpleRendererInfo”.
As such, an immersive audio encoder may provide various different types or sets of metadata. In particular, an immersive audio encoder may encode up to three (or more) types or sets of metadata (sideinfo, OAMD and SimpleRendererInfo) into a single bitstream. The provision of different types or sets of metadata provides flexibility with regards to the type of decoder which receives and which decodes the bitstream. On the other hand, the provision of different sets of metadata leads to a substantial increase of the data rate of a bitstream.
In view of the above, the present document addresses the technical problem of reducing the data rate of the metadata which is generated by an immersive audio encoder.