Audio for TV and cinema is typically linear and channel-based. Here, linear means that the audio starts at one point and moves at a constant rate and channel-based means that the audio tracks correspond directly to the loudspeaker positioning. For example, Dolby 5.1 surround sound defines six loudspeakers surrounding the listener and Dolby 22.2 surround sound defines 24channels with loudspeakers surrounding the listener at multiple height levels, enabling a 3D audio effect.
Object-based audio was introduced to decouple production and rendering. Each audio object represents a particular piece of audio content that has a spatial position in a 3D space (hereafter is referred to as the audio space) and other properties such as loudness and content type. Audio content of an audio object associated with a certain position in the audio space will be rendered by a rendering system such that the listener perceives the audio originates from that position in audio space. The same object-based audio can be rendered in any loudspeaker set-up, such as mono, stereo, Dolby 5.1, 7.1, 9.2 or 22.2 or a proprietary speaker system. The audio rendering system knows the loudspeaker set up and renders the audio for each loudspeaker. Audio object positions may be time-variable and audio objects do not need to be point objects, but can have a size and shape.
In certain situations however the number of audio objects can become too large for the available bandwidth of the delivery method. The number of audio objects can be reduced by processing the audio objects before transmission to a client, e.g. by removing or masking audio objects that are perceptually irrelevant and by clustering audio objects into an audio object cluster. Here, an audio object cluster is a single data object comprising audio data and metadata wherein the metadata is an aggregation of the metadata of its audio object components, e.g. the average of spatial positions, dimensions and loudness information.
US 20140079225 A1 describes an approach for efficiently capturing, processing, presenting, and/or associating audio objects with content items and geo-locations. A processing platform may determine a viewpoint of a viewer of at least one content item associated with a geo-location. Further, the processing platform and/or a content provider may determine at least one audio object associated with the at least one content item, the geo-location, or a combination thereof. Furthermore, the processing platform may process the at least one audio object for rendering one or more elements of the at least one audio object based, at least in part, on the viewpoint.
WO2014099285 describes examples of perception-based clustering of audio objects for rendering object-based audio content. Parameters used for clustering may include position (spatial proximity), width (similarity of the size of the audio objects), loudness and content type (dialog, music, ambient, effects, etc.). All audio objects (possibly compressed), audio object clusters and associated metadata are delivered together in a single data container on the basis of a standard delivery method (Blue-ray, broadcast, 3G or 4G or over-the-top, OTT) to the client.
One problem of the audio object clustering schemes described in WO2014099285 is that the position of the audio objects and audio object clusters are static with respect to the listener position and the listener orientation. The position and orientation of the listener are static and set by the audio producer in the production studio. When generating the audio object metadata, the audio object clusters and the associated metadata are determined relative to the static listener position and orientation (e.g. the position and orientation of a listener in a cinema or home theatre) and thereafter sent in a single data container to the client.
Hence, applications wherein a listener position is dynamic, such as for example an “audio-zoom” function in reality television wherein a listener can zoom into a specified direction or into a specific conversation or an “augmented audio” function wherein a listener is able to “walk around” in a real or virtual world, cannot be realized.
Such applications would require transmitting all individual audio objects for multiple listener positions to the client device without any clustering, thus re-introducing the bandwidth problem. Such scheme would require high bandwidth resources for distributing all audio objects, as well substantial processing power for rendering all audio data at the client side. Alternatively, a real-time, personalized rendering of the required audio objects, object clusters and metadata for a requested listener position may be considered. However, such solution would require a substantial amount of processing power at the server side, as well as a high aggregate bandwidth for the total number of listeners. None of these solutions provide a scalable solution for rendering audio objects on the basis of listener positions and orientations that can change in time and/or determined by the user or another application or party.
Hence, from the above it follows that there is a need in the art for improved methods, server and client devices that enable large groups of listeners to select and consume personalized surround-sound or 3D audio for different listener positions using only a limited amount of processing power and bandwidth.