Any discussion of the background art throughout the specification should in no way be considered as an admission that such art is widely known or forms part of common general knowledge in the field.
The new Dolby Atmos™ cinema system introduced the concept of a hybrid audio authoring, a distribution and playback representation that includes both audio beds (audio channels, also referred to static objects) and dynamic audio objects. In the present description, the term ‘audio objects’ relates to particular components of a captured audio input that are spatially, spectrally or otherwise distinct. Audio objects often originate from different physical sources. Examples of audio objects include audio such as voices, instruments, music, ambience, background noise and other sound effects such as approaching cars.
In the Atmos™ system, audio beds (or static objects) refer to audio channels that are meant to be reproduced at predefined, fixed loudspeaker locations. Dynamic audio objects, on the other hand, refer to individual audio elements that may exist for a defined duration in time and have spatial information describing certain properties of the object, such as its intended position, the object size, information indicating a specific subset of loudspeakers to be enabled for reproduction of the dynamic objects, and alike. This additional information is referred to as object metadata and allows the authoring of audio content independent of the end-point loudspeaker setup, since dynamic objects are not linked to specific loudspeakers. Furthermore, object properties may change over time, and consequently metadata can be time varying.
Reproduction of hybrid audio requires a renderer to transform the object-based audio representation to loudspeaker signals. A renderer takes as inputs (1) the object audio signals, (2) the object metadata, (3) the end-point loudspeaker setup, indicating the locations of the loudspeakers, and outputs loudspeaker signals. The aim of the renderer is to produce loudspeaker signals that result in a perceived object location that is equal to the intended location as specified by the object metadata. In the case that no loudspeaker is available at the intended position, a so-called phantom image is created by panning the object across two or more loudspeakers in the vicinity of the intended object position. In mathematical form, a conventional renderer can be described by a set of time-varying panning gains gi,j(t) being applied to a set of object audio signals xj (t) to result in a set of loudspeaker signals si(t):si(t)=Σjgi,j(t)xj(t)  (Eq 1)
In this formulation, index i refers to a loudspeaker, and index j is the object index. The panning gains gi,j(t) result from the loudspeaker positions Pi in the loudspeaker set P and time-varying object position metadata Mj(t)
                                          M            j                    ⁡                      (            t            )                          =                  [                                                                                          X                    j                                    ⁡                                      (                    t                    )                                                                                                                                            Y                    j                                    ⁡                                      (                    t                    )                                                                                                                                            Z                    j                                    ⁡                                      (                    t                    )                                                                                ]                                    (                  Eq          ⁢                                          ⁢          2                )            
based on a panning law or panning function :gi,j(t)=(P,Mj(t))  (Eq 3)
A wide range of methods of specifying  to compute panning gains for a given loudspeaker with index i and position Pi have been proposed in the past. These include, but are not limited to, the sine-cosine panning law, the tangent panning law, and the sine panning law (cf. Breebaart, 2013 for an overview). Furthermore, multi-channel panning laws such as vector-based amplitude panning (VBAP) have been proposed for 3-dimensional panning (Pulkki, 2002).
Amplitude panning has shown to work well when applied to pair-wise panning across loudspeakers in the horizontal (left-right) plane that are symmetrically placed in terms of their azimuth. The maximum azimuth aperture angle between loudspeakers for panning to work well amounts to approximately 60 degrees, allowing a phantom image to be created between −30 and +30 degrees azimuth. Panning across loudspeakers lateral to the listener (front to rear in the listening frame), however, causes a variety of problems:                When the listener is not exactly positioned in a desired audio ‘sweet spot’, or whenever loudspeakers are not exactly delay aligned at the listener's position, combing artifacts will arise when an object is panned across two loudspeakers. This combing effect deteriorates the perceived timbre of the phantom source, and results in a collapse of the spaciousness of the overall scene. Moreover, small changes in the orientation and position of the head will cause comb-filter notches and peaks to shift in frequency. As a result, the sweet spot in a multi-channel loudspeaker setup is often small and the perceived timbre strongly depends on the head orientation and position. This is sometimes referred to as ‘the rocking chair’ problem.        In pair-wise panning using symmetrically-placed loudspeakers in front of the listener, the contribution of the two loudspeakers results in sound-source localization cues at the level of the listener's eardrums that closely correspond to those arising from the intended sound source location. This process does not work reliably for panning across loudspeakers in the front-to-rear direction. As a result, the perceived phantom source location can be ambiguous, or may be very different from the intended source location.        Downmixing of rendered audio content (for example from Dolby Digital 5.1—ATSC A/52 standard—to stereo) causes an increase in the audio level of audio objects that are panned across front and surround loudspeakers. This is caused by the fact that panning laws are typically energy preserving, i.e.:1=Σigi,j2  (Eq 4)        
When the corresponding loudspeaker signals are downmixed electrically, a gain buildup will occur because for any gains 0≤gi,j≤1:Σigi,j≥√{square root over (Σigi,j2)}  (Eq 5)
The limitations of existing audio systems are particularly relevant for Dolby Digital 5.1 playback, and/or for loudspeaker configurations with 4 overhead loudspeakers such as 5.1.4 or 7.1.4. For such loudspeaker configurations, (dynamic) objects with metadata indicating a position in the middle of the room, or in the middle of the ceiling plane will typically be phantom-imaged between pair-wise remotely placed front and rear loudspeakers. Furthermore, side-surround channels may be produced as phantom images as well. An example of such phantom-imaging problem is visualized in FIG. 1, which illustrates a square room with four corner loudspeakers labeled ‘Lf’, ‘Rf’, ‘Ls’, and ‘Rs’, which are placed in the corners of the square room. A fifth center loudspeaker labeled ‘C’ is positioned directly in front of a listener's position (which corresponds roughly to the center of the room). An audio object with metadata coordinates (x=0, y=0.4) as depicted by the circle labeled ‘object’ is typically amplitude panned between loudspeakers labeled ‘Lf’ and ‘Ls’, as indicated by the arrows originating from ‘object’. Furthermore, if the content comprises more than five channels, for example also comprising a right side-surround channel (dashed-line loudspeaker icon labeled ‘Rss’ in FIG. 1), the signal associated with that channel may be reproduced by loudspeakers labeled ‘Rf’ and as' to preserve the spatial intent of that particular channel.
Amplitude panning as depicted in FIG. 1 can be thought of as compromising timbre and sweet spot size against maintaining spatial artistic intent for sweet-spot listening.
Note that with a 7-channel loudspeaker setup (e.g. including ‘Lss’ and ‘Rss’ loudspeakers), the content depicted in FIG. 1 would have significantly less phantom-imaging applied. In particular, the ‘Rss’ channel would be reproduced by a dedicated ‘Rss’ loudspeaker, while the object at y=0.4 would be reproduced mostly by the ‘Lss’ loudspeaker, with only a small amount of leakage to the ‘Lf’ loudspeaker.
There is a desire to mitigate the limitations imposed by prior-art amplitude panning.