The Dolby® Atmos cinema system introduced a hybrid audio authoring, distribution and playback format for audio information that includes both “audio beds” and “audio objects.” The term “audio beds” refers to conventional audio channels that are intended to be reproduced by acoustic transducers at predefined, fixed locations. The term “audio objects” refers to individual audio elements or sources of aural content that may exist for a limited duration in time and have spatial information or “spatial metadata” describing one or more spatial characteristics such as position, velocity and size of each object. The audio information representing beds and objects can be stored or transmitted separately and used by a spatial reproduction system to recreate the artistic intent of the audio information using a variety of configurations of acoustic transducers. The numbers and locations of the acoustic transducers may vary from one configuration to another.
Motion picture soundtracks that comply with Dolby Atmos cinema system specifications may have as many as 7, 9 or even 11 audio beds of audio information. Dolby Atmos cinema system soundtracks may also include audio information representing hundreds of individual audio objects, which are “rendered” by the soundtrack playback process to generate audio signals that are particularly suited for acoustic transducers in a specified configuration. The rendering process generates audio signals to drive a specified configuration of acoustic transducers so that the sound field generated by those acoustic transducers reproduces the intended spatial characteristics of the audio objects, thereby providing listeners with a spatially diverse and immersive audio experience.
The advent of object-based audio has significantly increased the amount of audio data needed to represent the aural content of a soundtrack and has significantly increased the complexity of the process needed to process and playback this data For example, cinematic soundtracks may comprise many sound elements corresponding to objects on and off the screen, dialog, noises, and sound effects that combine with background music and ambient effects to create the overall auditory experience. Accurate rendering requires that sounds be reproduced in a way that listener impressions correspond as closely as possible to sound source position, intensity, movement and depth for objects appearing on the screen as well as off the screen. Object-based audio represents a significant improvement over traditional channel-based audio systems that send audio content in the form of audio signals for individual acoustic transducers in predefined locations within a listening environment. These traditional channel-based systems are limited in the spatial impressions that they can create.
A soundtrack that contains a large number of audio objects imposes several challenges on the playback system. Each object requires a rendering process that determines how the object audio signal should be distributed among the available acoustic transducers. For example, in a so-called 5.1-channel reproduction system consisting of left-front, right-front, center, low-frequency effects, left-surround, right-surround channels, the sound of an audio object may be reproduced by any subset of these acoustic transducers. The rendering process determines which channels and acoustic transducers are used in response to the object's spatial metadata. Because the relative level or loudness of the sound reproduced by each acoustic transducer greatly influences the position perceived by listeners, the rendering process can perform its function by determining panning gains or relative levels for each acoustic transducer to create an aural impression of spatial position in listeners that closely resembles the intended audio object location as specified by its spatial metadata. If the sounds of multiple objects are to be reproduced over several acoustic transducers, the panning gains or relative levels determined by the rendering process can be represented by coefficients in a rendering matrix. These coefficients determine the gain for the aural content of each object for each acoustic transducer.
The value of the coefficients in a rendering matrix will vary in time to reproduce the aural effect of moving objects. The storage capacity and the bandwidth needed to store and convey the spatial metadata for all audio objects in a soundtrack may be kept within specified limits by controlling how often spatial metadata is changed, thereby controlling how often the values of the coefficients in a rendering matrix are changed. In typical implementations, the matrix coefficients are changed once in a period between 10 and 500 milliseconds in length, depending on a number factors including the speed of the object, the required positional accuracy, and the capacity available to store and transmit the spatial metadata.
When a playback system performs discontinuous rendering matrix updates, the demands for accurate spatial impressions may require some form of interpolation of either the spatial metadata or the updated values of the rendering matrix coefficients. Without interpolation, large changes in the rendering matrix coefficients may cause undesirable artifacts in the reproduced audio such as clicking sounds, zipper-like noises or objectionable jumps in spatial position.
The need for interpolation causes problems for existing or “legacy” systems that playback distribution media like the Blu-ray disc supporting lossless codecs such as those that conform to specifications for Meridian Lossless Packing (MLP). Additional details for MLP may be obtained from Gerzon et al., “The MLP Lossless Compression System for PCM Audio,” J. AES, vol. 52, no. 3, pp. 243-260, March 2004.
An implementation of the MLP coding technique allows several user-specified options for encoding multiple presentations of the input audio. In one option, a medium can store up to 16 discrete audio channels. A reproduction of all 16 channels is referred to as a “top-level presentation.” These 16 channels may be downmixed into any of several other presentations using a smaller number of channels by means of downmixing matrices whose coefficients are invariant during specified intervals of time. When used for legacy Blu-Ray streams, for example, up to three downmix presentations can be generated. These downmix presentations may have up to 8, 6 or 2 channels, respectively, which are often used for 7.1 channel, 5.1 channel and 2-channel stereo formats. The audio information needed for the top-level presentation is encoded/decoded losslessly by exploiting correlations between the various presentations. The downmix presentations are constructed from a cascade of matrices that give bit-for-bit reproducible downmixes and offer the benefit of requiring only 2-channel decoders to decode presentations for no more than two channels, requiring only 6-channel decoders to decode presentations for no more than six channels, and requiring 8-channel decoders to decode presentations for no more than eight channels.
For object-based content, however, this multi-level presentation approach is problematic. If the top-level presentation consists of objects, or clusters of objects, augmented with spatial metadata, the downmix presentations require interpretation and interpolation of the spatial metadata used to create 2-channel stereo, 5.1 or 7.1 backward-compatible mixes. These backward compatible mixes are required for legacy Blu-ray players that do not support object-based audio information. Unfortunately, matrix interpolation is not implemented in legacy players and the rate of matrix updates in the implementation described above are limited to only once in a 40-sample interval or integer multiples thereof. Updates of rendering matrix coefficients without interpolation between updates is referred to herein as discontinuous rendering matrix updates. The discontinuous matrix updates that occur at the rates permitted by existing or legacy systems may generate unacceptable artifacts such as zipper noise, clicks and spatial discontinuities.
One potential solution to this problem is to limit the magnitude of the changes in rendering matrix coefficients so that the changes do not generate audible artifacts for critical content. Unfortunately, this solution would limit coefficient changes to be on the order of just a few decibels per second, which is generally too slow for accurate rendering of dynamic content in many motion picture soundtracks.