With increasing multimedia content consumption in daily life, the demand for sophisticated multimedia solutions steadily increases. In this context, positioning of audio objects plays an important role. An optimal positioning of audio objects for an existing loudspeaker setup would be desirable.
In the state of the art, audio objects are known. Audio objects may, e.g., be considered as sound tracks with associated metadata. The metadata may, e.g., describe the characteristics of the raw audio data, e.g., the desired playback position or the volume level. An advantage of object-based audio is that a predefined movement can be reproduced by a special rendering process on the playback side in the best way possible for all reproduction loudspeaker layouts.
Geometric metadata can be used to define where an audio object should be rendered, e.g., angles in azimuth or elevation or absolute positions relative to a reference point, e.g., the listener. The metadata is stored or transmitted along with the object audio signals.
In the context of MPEG-H, at the 105th MPEG meeting the audio group reviewed the requirements and timelines of different application standards (MPEG=Moving Picture Experts Group). According to that review, it would be essential to meet certain points in time and specific requirements for a next generation broadcast system. According to that, a system should be able to accept audio objects at the encoder input. Moreover, the system should support signaling, delivery and rendering of audio objects and should enable user control of objects, e.g., for dialog enhancement, alternative language tracks and audio description language.
In the state of the art, different concepts are known. A first concept is reflected sound rendering for object-based audio (see [2]). Snap to speaker location information is included in a metadata definition as useful rendering information. However, in [2], no information is provided how the information is used in the playback process. Moreover, no information is provided how a distance between two positions is determined.
Another concept of the state of the art, system and tools for enhanced 3D audio authoring and rendering is described in [5]. FIG. 6B of document [5] is a diagram illustrating how a “snapping” to a speaker might be algorithmically realized. In detail, according to the document [5] if it is determined to snap the audio object position to a speaker location (see block 665 of FIG. 6B of document [5]), the audio object position will be mapped to a speaker location (see block 670 of FIG. 6B of document [5]), generally the one closest to the intended (x,y,z) position received for the audio object. According to [5], the snapping might be applied to a small group of reproduction speakers and/or to an individual reproduction speaker. However, [5] employs Cartesian (x,y,z) coordinates instead of spherical coordinates. Moreover, the renderer behavior is just described as map audio object position to a speaker location; if the snap flag is one, no detailed description is provided. Furthermore, no details are provided how the closest speaker is determined.
According to another conventional technology, System and Method for Adaptive Audio Signal Generation, Coding and Rendering, described in document [1], metadata information (metadata elements) specify that “one or more sound components are rendered to a speaker feed for playback through a speaker nearest an intended playback location of the sound component, as indicated by the position metadata”. However, no information is provided, how the nearest speaker is determined.
In a further conventional technology, audio definition model, described in document [4], a metadata flag is defined called “channelLock”. If set to 1, a renderer can lock the object to the nearest channel or speaker, rather than normal rendering. However, no determination of the nearest channel is described.
In another conventional technology, upmixing of object based audio is described (see [3]). Document [3] describes a method for the usage of a distance measure of speakers in a different field of application: Here it is used for upmixing object-based audio material. The rendering system is configured to determine, from an object based audio program (and knowledge of the positions of the speakers to be employed to play the program), the distance between each position of an audio source indicated by the program and the position of each of the speakers. Furthermore, the rendering system of [3] is configured to determine, for each actual source position (e.g., each source position along a source trajectory) indicated by the program, a subset of the full set of speakers (a “primary” subset) consisting of those speakers of the full set which are (or the speaker of the full set which is) closest to the actual source position, where “closest” in this context is defined in some reasonably defined sense. However, no information is provided how the distance should be calculated.