The advent of object-based audio has significantly increased the amount of audio data and the complexity of rendering this data within high-end playback systems. For example, cinema sound tracks may comprise many different sound elements corresponding to images on the screen, dialog, noises, and sound effects that emanate from different places on the screen and combine with background music and ambient effects to create the overall auditory experience. Accurate playback requires that sounds be reproduced in a way that corresponds as closely as possible to what is shown on screen with respect to sound source position, intensity, movement, and depth. Object-based audio represents a significant improvement over traditional channel-based audio systems that send audio content in the form of speaker feeds to individual speakers in a listening environment, and are thus relatively limited with respect to spatial playback of specific audio objects.
The introduction of digital cinema and the development of three-dimensional (“3D”) content has created new standards for sound, such as the incorporation of multiple channels of audio to allow for greater creativity for content creators, and a more enveloping and realistic auditory experience for audiences. Expanding beyond traditional speaker feeds and channel-based audio as a means for distributing spatial audio is critical, and there has been considerable interest in a model-based audio description that allows the listener to select a desired playback configuration with the audio rendered specifically for their chosen configuration. The spatial presentation of sound utilizes audio objects, which are audio signals with associated parametric source descriptions of apparent source position (e.g., 3D coordinates), apparent source width, and other parameters. Further advancements include a next generation spatial audio (also referred to as “adaptive audio”) format that comprises a mix of audio objects and traditional channel-based speaker feeds (beds) along with positional metadata for the audio objects.
New professional and consumer-level cinema systems (such as the Dolby® Atmos™ system) have been developed to further the concept of hybrid audio authoring, which is a distribution and playback format that includes both audio beds (channels) and audio objects. Audio beds refer to audio channels that are meant to be reproduced in predefined, fixed speaker locations while audio objects refer to individual audio elements that may exist for a defined duration in time but also have spatial information describing the position, velocity, and size (as examples) of each object. During transmission beds and objects can be sent separately and then used by a spatial reproduction system to recreate the artistic intent using a variable number of speakers in known physical locations. In some soundtracks, there may be up to 7, 9 or even 11 bed channels containing audio. Additionally, based on the capabilities of an authoring system there may be tens or even hundreds of individual audio objects that are combined during rendering to create a spatially diverse and immersive audio experience.
FIG. 1A illustrates the combination of channel and object-based data to produce an adaptive audio mix, under an embodiment. As shown in process 100, the channel-based data 102, which, for example, may be 5.1 or 7.1 surround sound data provided in the form of pulse-code modulated (PCM) data is combined with audio object data 104 to produce an adaptive audio mix 108. The audio object data 104 is produced by combining the elements of the original channel-based data with associated metadata that specifies certain parameters pertaining to the location of the audio objects. As shown conceptually in FIG. 1A, the authoring tools provide the ability to create audio programs that contain a combination of speaker channel groups and object channels simultaneously. For example, an audio program could contain one or more speaker channels optionally organized into groups (or tracks, e.g., a stereo or 5.1 track), descriptive metadata for one or more speaker channels, one or more object channels, and descriptive metadata for one or more object channels.
The large number of audio signals present in object-based content poses new challenges for the rendering of such content. Each object requires a rendering process, which determines how the object signal should be distributed over the available reproduction channels. For example, in a loudspeaker reproduction system consisting of a 5.1 setup with left front, right front, center, low-frequency effects, left surround, right surround channels, an object may be reproduced by any subset of these loudspeakers, depending on their spatial information. The (relative) level of each loudspeaker greatly influences the perceived position by the listener. In practical systems, a panning law or panning system is used to determine the so-called panning gains or relative level of each loudspeaker to result in a perceived object location that closely resembles the intended object location as indicated by its spatial information or metadata. If multiple objects are to be distributed over several loudspeakers, the process of panning can be represented by a panning or rendering matrix, which determines the gain (or signal proportion) of each object to each loudspeaker. In practical cases, such rendering matrix will be time varying to allow for variable object positions.
Besides position metadata, other, more advanced metadata may be associated with objects as well. For example, a speaker mask may be included in an object's metadata, which indicates a subset of loudspeakers that should be used for rendering. Alternatively, certain loudspeakers may be excluded for rendering an object. For example, an object may be associated with a speaker mask that excludes the surround channels or ceiling channels for rendering that object. Alternatively, or additionally, an object may have metadata that signal the rendering of an object by a speaker array rather than a single speaker or pair of loudspeakers. For practical and efficiency reasons, such metadata are often of binary nature (e.g., a certain loudspeaker is, or is not used to render a certain object). In practical systems, the use of such advanced metadata influences the coefficients present in the rendering matrix.
In object-based audio systems, object metadata is typically updated relatively infrequently (sparsely) in time to limit the associated data rate. Typical update intervals for object positions can range between 10 and 500 milliseconds, depending on the speed of the object, the required position accuracy, the available bandwidth to store or transmit metadata, and so on. Such sparse, or even irregular metadata updates require interpolation of metadata and/or rendering matrices for audio samples in-between two subsequent metadata instances. Without interpolation, the consequential step-wise changes in the rendering matrix may cause undesirable switching artifacts, clicking sounds, zipper noises, or other undesirable artifacts as a result of spectral splatter introduced by step-wise matrix updates.
FIG. 1B illustrates a typical known process to compute a rendering matrix for a set of metadata instances. As shown in FIG. 1B, a set of metadata instances (m1 to m4) 120 correspond to a set of time instances (t1 to t4) which are indicated by their position along the time axis 124. Subsequently, each metadata instance is converted to a respective rendering matrix (c1 to c4) 122, or a complete rendering matrix that is valid at that same time instance. Thus, as shown, metadata instance m1 creates rendering matrix c1 at time t1, metadata instance m2 creates rendering matrix c2 at time t2, and so on. For simplicity, FIG. 1B shows only one rendering matrix for each metadata instance m1 to m4. In practical systems, however, a rendering matrix may comprise a set of rendering matrix coefficients or gain coefficients c1,i,j to be applied to object signal with index j to create output signal with index i:
            y      j        ⁡          (      t      )        =            ∑      i        ⁢                  ⁢                            x          i                ⁡                  (          t          )                    ⁢              c                  1          ,          i          ,          j                    In the above equation xi(t) represents the signal of object i, and yi(t) represents output signal with index j.
The rendering matrices generally comprise coefficients that represent gain values at different instances in time. Metadata instances are defined at certain discrete times, and for audio samples in-between the metadata time stamps, the rendering matrix is interpolated, as indicated by the dashed line 126 connecting the rendering matrices 122. Such interpolation can be performed linearly, but also other interpolation methods can be used (such as band-limited interpolation, sine/cosine interpolation, and so on). The time interval between the metadata instances (and corresponding rendering matrices) is referred to as an “interpolation duration,” and such intervals may be uniform or they may be different, such as the longer interpolation duration between times t3 and t4 as compared to the interpolation duration between times t2 and t3.
In general, present metadata update and interpolation systems are sufficient for relatively simple objects in which the metadata definitions dictate object position and/or gain values for speakers. The change of such values can usually be adequately be interpolated in present systems by interpolation of metadata instances. For complex objects and cases in which the metadata instances are limited to certain possible values, present interpolation methods operating on metadata directly are typically unsatisfactory. For example, if a metadata instance is limited to one of two values (binary metadata), standard interpolation techniques would derive the incorrect value about half the time.
In many cases, the calculation of rendering matrix coefficients from metadata instances is well defined, but the reverse process of calculating metadata instances given a (interpolated) rendering matrix, is often difficult, or even impossible. In this respect, the process of generating a rendering matrix from metadata can sometimes be regarded as a cryptographic one-way function. The process of calculating new metadata instances between existing metadata instances is referred to as “resampling” of the metadata. Resampling of metadata is often required during certain audio processing tasks. For example, when audio content is edited, by cutting/merging/mixing and so on, such edits may occur in between metadata instances. In this case, resampling of the metadata is required. Another such case is when audio and associated metadata are encoded with a frame-based audio coder. In this case, it is desirable to have at least one metadata instance for each audio codec frame, preferably with a time stamp at the start of that codec frame, to improve resilience of frame losses during transmission. As stated above, interpolation of metadata is also ineffective for certain types of metadata, such as binary-valued metadata. For example, if binary flags such as zone exclusion masks are used, it is virtually impossible to estimate a valid set of metadata from the rendering matrix coefficients or from neighboring instances of metadata. This is shown in FIG. 1B as a failed attempt to extrapolate or derive a metadata instance m3a from the rendering matrix coefficients in the interpolation duration between times t3 and t4.
Thus, in present metadata processing for adaptive audio, any metadata resampling or upsampling process by means of interpolation is practically impossible without introducing inaccuracies in the resulting rendering matrix coefficients, and hence a loss in spatial audio quality.
The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions.