Spatial audio coding tools are well-known in the art and are, for example, standardized in the MPEG-surround standard. Spatial audio coding starts from original input channels such as five or seven channels which are identified by their placement in a reproduction setup, i.e., a left channel, a center channel, a right channel, a left surround channel, a right surround channel and a low frequency enhancement channel. A spatial audio encoder typically derives one or more downmix channels from the original channels and, additionally, derives parametric data relating to spatial cues such as interchannel level differences, interchannel phase differences, interchannel time differences, etc. The one or more downmix channels are transmitted together with the parametric side information indicating the spatial cues to a spatial audio decoder which decodes the downmix channel and the associated parametric data in order to finally obtain output channels which are an approximated version of the original input channels. The placement of the channels in the output setup is typically fixed and is, for example, a 5.1 format, a 7.1 format, etc.
Such channel-based audio formats are widely used for storing or transmitting multi-channel audio content where each channel relates to a specific loudspeaker at a given position. A faithful reproduction of these kind of formats involves a loudspeaker setup where the speakers are placed at the same positions as the speakers that were used during the production of the audio signals. While increasing the number of loudspeakers improves the reproduction of truly immersive 3D audio scenes, it becomes more and more difficult to fulfill this requirement—especially in a domestic environment like a living room.
The necessity of having a specific loudspeaker setup can be overcome by an object-based approach where the loudspeaker signals are rendered specifically for the playback setup.
For example, spatial audio object coding tools are well-known in the art and are standardized in the MPEG SAOC standard (SAOC=Spatial Audio Object Coding). In contrast to spatial audio coding starting from original channels, spatial audio object coding starts from audio objects which are not automatically dedicated for a certain rendering reproduction setup. Instead, the placement of the audio objects in the reproduction scene is flexible and can be determined by the user by inputting certain rendering information into a spatial audio object coding decoder. Alternatively or additionally, rendering information, i.e., information at which position in the reproduction setup a certain audio object is to be placed typically over time can be transmitted as additional side information or metadata. In order to obtain a certain data compression, a number of audio objects are encoded by an SAOC encoder which calculates, from the input objects, one or more transport channels by downmixing the objects in accordance with certain downmixing information. Furthermore, the SAOC encoder calculates parametric side information representing inter-object cues such as object level differences (OLD), object coherence values, etc. The inter object parametric data is calculated for parameter time/frequency tiles, i.e., for a certain frame of the audio signal comprising, for example, 1024 or 2048 samples, 28, 20, 14 or 10, etc., processing bands are considered so that, in the end, parametric data exists for each frame and each processing band. As an example, when an audio piece has 20 frames and when each frame is subdivided into 28 processing bands, then the number of time/frequency tiles is 560.
In an object-based approach, the sound field is described by discrete audio objects. This involves object metadata that describes among others the time-variant position of each sound source in 3D space.
A first metadata coding concept in conventional technology is the spatial sound description interchange format (SpatDIF), an audio scene description format which is still under development [M1]. It is designed as an interchange format for object-based sound scenes and does not provide any compression method for object trajectories. SpatDIF uses the text-based Open Sound Control (OSC) format to structure the object metadata [M2]. A simple text-based representation, however, is not an option for the compressed transmission of object trajectories.
Another metadata concept in conventional technology is the Audio Scene Description Format (ASDF) [M3], a text-based solution that has the same disadvantage. The data is structured by an extension of the Synchronized Multimedia Integration Language (SMIL) which is a sub set of the Extensible Markup Language (XML) [M4], [M5].
A further metadata concept in conventional technology is the audio binary format for scenes (AudioBIFS), a binary format that is part of the MPEG-4 specification [M6], [M7]. It is closely related to the XML-based Virtual Reality Modeling Language (VRML) which was developed for the description of audio-visual 3D scenes and interactive virtual reality applications [M8]. The complex AudioBIFS specification uses scene graphs to specify routes of object movements. A major disadvantage of AudioBIFS is that is not designed for real-time operation where a limited system delay and random access to the data stream are a requirement. Furthermore, the encoding of the object positions does not exploit the limited localization performance of human listeners. For a fixed listener position within the audio-visual scene, the object data can be quantized with a much lower number of bits [M9]. Hence, the encoding of the object metadata that is applied in AudioBIFS is not efficient with regard to data compression.