With increasing multimedia content consumption in daily life, the demand for sophisticated multimedia solutions steadily increases. In this context, the integration of visual and audio content plays an important role. An optimal adjustment of visual and audio multimedia content to the available visual and audio replay setup would be desirable.
In the state of the art, audio objects are known. Audio objects may, e.g., be considered as sound tracks with associated metadata. The metadata may, e.g., describe the characteristics of the raw audio data, e.g., the desired playback position or the volume level. An advantage of object-based audio is that a predefined movement can be reproduced by a special rendering process on the playback side in the best way possible for all reproduction loudspeaker layouts.
Geometric metadata can be used to define where an audio object should be rendered, e.g., angles in azimuth or elevation or absolute positions relative to a reference point, e.g., the listener. The metadata is stored or transmitted along with the object audio signals.
In the context of MPEG-H, at the 105th MPEG meeting the audio group reviewed the requirements and timelines of different application standards (MPEG=Moving Picture Experts Group). According to that review, it would be essential to meet certain points in time and specific requirements for a next generation broadcast system. According to that, a system should be able to accept audio objects at the encoder input. Moreover, the system should support signaling, delivery and rendering of audio objects and should enable user control of objects, e.g., for dialog enhancement, alternative language tracks and audio description language.
In the state of the art, different concepts are provided. According to a first conventional technology, presented in “Method and apparatus for playback of a higher-order ambisonics audio signal” (see [1]), the playback of spatial sound field-oriented audio to its linked visible objects is adapted by applying space warping processing. In that conventional technology, the decoder warps the sound field such that all sound objects in the direction of the screen are compressed or stretched according to the ratio of the sizes of the target and reference screens. A possibility is included to encode and transmit the reference size (or the viewing angle from a reference listening position) of the screen used in the content production as metadata together with the content. Alternatively, a fixed reference screen size is assumed in encoding and for decoding, and the decoder knows the actual size of the target screen. In this conventional technology, the decoder warps the sound field in such a manner that all sound objects in the direction of the screen are compressed or stretched according to the ratio of the size of the target screen and the size of the reference screen. So-called “two-segment piecewise linear” warping functions are used. The stretching is limited to the angular positions of sound items. In that conventional technology, for centered screens the definition of the warping function is similar to the definition of the mapping function for screen-related remapping. The first and the third segment of the three-segment piecewise linear mapping function the mapping function could be defined as a two-segment piecewise linear function. However, with that conventional technology, the application is limited to HOA (HOA=higher order ambisonics) (sound field-oriented) signals in space domain. Moreover, the warping function is only dependent on ratio of reference screen and reproduction screen, no definition for non-centered screens is provided.
In another conventional technology, “Vorrichtung und Verfahren zum Bestimmen einer Wiedergabeposition” (see [2]), a method to adapt the position of a sound source to the video reproduction is described. The playback position of the sound source is determined individually for each sound object in dependence of direction and distance to the reference point and of camera parameters. That conventional technology also describes a screen with a fixed reference size is assumed. A linear scaling of all position parameters (in Cartesian coordinates) is conducted for adapting the scene to a reproduction screen that is larger or smaller than the reference screen. However, according to that prior art, the incorporation of physical camera and projection parameters is complex, and such parameters are not always available. Moreover, the method of that conventional technology works in Cartesian coordinates (x,y,z), so not just the position but also the distance of an object changes with scene scaling. Furthermore, this conventional technology is not applicable for an adaption of the object's position with respect to changes of relative screen size (aperture angle, viewing angle) in angular coordinates.
In a further conventional technology, “Verfahren zur Audiocodierung” (see [3]), a method is described which includes a transmission of the current (time-varying) horizontal and vertical viewing angle in the data stream (reference viewing angle, in relation to the listener's position in the original scene). On the reproduction side, the size and position of the reproduction are analyzed and the playback of the sound objects is individually optimized to match with the reference screen.
In another conventional technology, “Acoustical Zooming Based on a parametric Sound Field Representation” (see [4]), a method is described, which provides audio rendering that follows the movement of the visual scene (“Acoustical zoom”). The acoustical zooming process is defined as a shift of the virtual recording position. The scene-model for the zooming algorithm places all sound sources on a circle with an arbitrary but fixed radius. However, the method of that conventional technology works in the DirAC parameter domain, distance and angles (direction of arrival) are changed, the mapping function is non-linear and depends on a zoom factor/parameter and non-centered screens are not supported.