The introduction of digital cinema and the development of true three-dimensional (“3D”) or virtual 3D content has created new standards for sound, such as the incorporation of multiple channels of audio to allow for greater creativity for content creators and a more enveloping and realistic auditory experience for audiences. Expanding beyond traditional speaker feeds and channel-based audio as a means for distributing spatial audio is critical, and there has been considerable interest in a model-based audio description that allows the listener to select a desired playback configuration with the audio rendered specifically for their chosen configuration. The spatial presentation of sound utilizes audio objects, which are audio signals with associated parametric source descriptions of apparent source position (e.g., 3D coordinates), apparent source width, and other parameters. Further advancements include a next generation spatial audio (also referred to as “adaptive audio”) format has been developed that comprises a mix of audio objects and traditional channel-based speaker feeds along with positional metadata for the audio objects. In a spatial audio decoder, the channels are sent directly to their associated speakers or down-mixed to an existing speaker set, and audio objects are rendered by the decoder in a flexible (adaptive) manner. The parametric source description associated with each object, such as a positional trajectory in 3D space, is taken as an input along with the number and position of speakers connected to the decoder. The renderer then utilizes certain algorithms, such as a panning law, to distribute the audio associated with each object (“object-based audio”) across the attached set of speakers. The authored spatial intent of each object is thus optimally presented over the specific speaker configuration that is present in the listening room.
In traditional channel-based audio systems, audio post-processing does not change over time due to changes in bitstream content. Since audio carried throughout the system is always identified using static channel identifiers (such as Left, Right, Center, etc.), individual audio post-processing technology may always remain active. An object-based audio system, however, uses new audio post-processing mechanisms that use specialized metadata to render object-based audio to a channel-based speaker layout. In practice, an object-based audio system must also support and handle channel-based audio, in part to support legacy audio content. Since channel-based audio lacks the specialized metadata that enables audio rendering, certain audio post-processing technologies may be different when the coded audio source contains object-based or channel-based audio. For example, an upmixer may be used to generate content for speakers that are not present in the incoming channel-based audio, and such an upmixer would not be applied to object-based audio.
In most present systems, an audio program generally contains only one type of audio, either object-based or channel-based, and thus the processing chain (rendering or upmixing) may be chosen at initialization time. With the advent of new audio formats, however, the audio type (channel or object) in a program may change over time, due to transmission medium, creative choice, user interaction, or other similar factors. In a hybrid audio system, it is possible for audio to switch between object-based and channel-based audio without changing the codec. In this case, the system optimally does not exhibit muting or audio delay, but rather provides a continuous audio stream to all of its speaker outputs by switching between rendered object output and upmixed channel output, since one problem in present audio systems is that they may mute or glitch on such a change in the bitstream.
For adaptive audio content having both objects and channels, modern Audio/Video Receiver (AVR) systems, such as those that may utilize Dolby® Atmos® technology or other adaptive audio standards, generally consist of one or more Digital Signal Processor (DSP) chips, and one or more microcontroller chips or cores of a single chip (e.g. a System on Chip, SoC). The microcontroller is responsible for managing the processing on the DSP and interacting with the user, while the DSP is optimized specifically to perform audio processing. When switching between object-based and channel-based audio, it may be possible for the DSP to signal the change to the microcontroller, which then uses logic to reconfigure the DSP to handle the new audio type. This type of signaling is referred to as “out-of-band” signaling since it occurs between the DSP and microcontroller. Such out-of-band signaling necessarily takes some amount of time due to factors such as processing overhead, transmission latencies, data switching overhead, and this often leads to unnecessary muting, or possible glitching of the audio if the DSP incorrectly processes the audio data.
What is needed, therefore, is a way to switch between object-based and channel-based content that provides a continuous or smooth audio stream without gaps, mutes, or glitches. What is further needed is a mechanism that allows an audio-processing DSP to select the correct processing chain for the incoming audio, without needing to communicate externally to other processors or microcontrollers.
With respect to object audio rendering systems having an object audio renderer, object-based audio comprises portions of digital audio data (e.g., samples of PCM audio) along with metadata that defines how the associated samples are to be rendered. The proper timing of the metadata updates with the corresponding samples of audio data is therefore important for accurate rendering of the audio objects. In a dynamic audio program with many objects and/or with objects that may move quickly around the sound space, the metadata updates may occur very quickly with respect to the audio frame rate. Present object-based audio processing systems are generally capable of handling metadata updates that occur regularly and at a rate that is within the processing capabilities of the decoder and rendering processors. Such systems often rely on audio frames that are of a set size and metadata updates that are applied at a uniformly periodic rate. However, as updates occur more quickly or in a non-uniformly periodic manner, processing the updates becomes much more challenging. Often, an update may not be properly aligned with the audio samples to which it applies, either because updates occur too quickly or synchronization slips between metadata updates and the corresponding audio samples. In this case, audio samples may be rendered according to improper metadata definitions.
What is further needed is a mechanism to adapt a codec decoded output to properly buffer and deserialize the metadata for adaptive audio systems in the most efficient way possible. What is further needed is an object audio renderer interface that is configured to ensure that object audio is rendered with the least amount of processing power and the high accuracy, and that is also adjustable to customer needs, depending on their chip architecture.
The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions. Dolby, Dolby Digital Plus, Dolby TrueHD, and Atmos are trademarks of Dolby Laboratories Licensing Corporation.