New professional and consumer-level audio-visual (AV) systems (such as the Dolby® Atmos™ system) have been developed to render hybrid audio content using a format that includes both audio beds (channels) and audio objects. Audio beds refer to audio channels that are meant to be reproduced in predefined, fixed speaker locations (e.g., 5.1 or 7.1 surround) while audio objects refer to individual audio elements that exist for a defined duration in time and have spatial information describing the position, velocity, and size (as examples) of each object. During transmission beds and objects can be sent separately and then used by a spatial reproduction system to recreate the artistic intent using a variable number of speakers in known physical locations. Based on the capabilities of an authoring system there may be tens or even hundreds of individual audio objects (static and/or time-varying) that are combined during rendering to create a spatially diverse and immersive audio experience. In an embodiment, the audio processed by the system may comprise channel-based audio, object-based audio or object and channel-based audio. The audio comprises or is associated with metadata that dictates how the audio is rendered for playback on specific devices and listening environments. In general, the terms “hybrid audio” or “adaptive audio” are used to mean channel-based and/or object-based audio signals plus metadata that renders the audio signals using an audio stream plus metadata in which the object positions are coded as a 3D position in space.
Adaptive audio systems thus represent the sound scene as a set of audio objects in which each object is comprised of an audio signal (waveform) and time varying metadata indicating the position of the sound source. Playback over a traditional speaker set-up such as a 7.1 arrangement (or other surround sound format) is achieved by rendering the objects to a set of speaker feeds. The process of rendering comprises in large part (or solely) a conversion of the spatial metadata at each time instant into a corresponding gain matrix, which represents how much of each of the object feeds into a particular speaker. Thus, rendering “N” audio objects to “M” speakers at time “t” (t) can be represented by the multiplication of a vector x(t) of length “N”, comprised of the audio sample at time t from each object, by an “M-by-N” matrix A(t) constructed by appropriately interpreting the associated position metadata (and any other metadata such as object gains) at time t. The resultant samples of the speaker feeds at time t are represented by the vector y(t). This is shown below in Eq. 1:
                                                                        [                                                                                                                              y                          0                                                ⁡                                                  (                          t                          )                                                                                                                                                                                                  y                          1                                                ⁡                                                  (                          t                          )                                                                                                                                                ⋮                                                                                                                                                    y                                                      M                            -                            1                                                                          ⁡                                                  (                          t                          )                                                                                                                    ]                                                                                        y                ⁡                                  (                  t                  )                                                                    =                                                                              [                                                                                                                                          a                            00                                                    ⁡                                                      (                            t                            )                                                                                                                                                                            a                            01                                                    ⁡                                                      (                            t                            )                                                                                                                                                                            a                            02                                                    ⁡                                                      (                            t                            )                                                                                                                      …                                                                                                                          a                                                          0                              ,                                                              N                                -                                1                                                                                                              ⁡                                                      (                            t                            )                                                                                                                                                                                                                    a                            10                                                    ⁡                                                      (                            t                            )                                                                                                                      ⋮                                                                    ⋮                                                                    ⋮                                                                    ⋮                                                                                                            ⋮                                                                    ⋮                                                                    ⋮                                                                    ⋮                                                                    ⋮                                                                                                                                                                  a                                                                                          M                                -                                1                                                            ,                              0                                                                                ⁡                                                      (                            t                            )                                                                                                                      ⋮                                                                    ⋮                                                                    ⋮                                                                                                                          a                                                                                          M                                -                                1                                                            ,                                                              N                                -                                1                                                                                                              ⁡                                                      (                            t                            )                                                                                                                                ]                                                                                                      A                  ⁡                                      (                    t                    )                                                                                ⁢                                                                      [                                                                                                                                          x                            0                                                    ⁡                                                      (                            t                            )                                                                                                                                                                                                                    x                            1                                                    ⁡                                                      (                            t                            )                                                                                                                                                                                                                    x                            2                                                    ⁡                                                      (                            t                            )                                                                                                                                                              ⋮                                                                                                                                                                  x                                                          N                              -                              1                                                                                ⁡                                                      (                            t                            )                                                                                                                                ]                                                                                                      x                  ⁡                                      (                    t                    )                                                                                                          (                  Eq          .                                          ⁢          1                )            
The matrix equation of Eq. 1 above represents an adaptive audio (e.g., Atmos) rendering perspective, but it can also represent a generic set of scenarios where one set of audio samples is converted to another set by linear operations. In an extreme case A(t) is a static matrix and may represent a conventional downmix of a set of audio channels x(t) to a fewer set of channels y(t). For instance, x(t) could be a set of audio channels that describe a spatial scene in an Ambisonics format, and the conversion to speaker feeds y(t) may be prescribed as multiplication by a static downmix matrix. Alternatively, x(t) could be a set of speaker feeds for a 7.1 channel layout, and the conversion to a 5.1 channel layout may be prescribed as multiplication by a static downmix matrix.
To provide audio reproduction that is as accurate as possible, adaptive audio systems are often used with high-definition audio codecs (coder-decoder) systems, such as Dolby TrueHD. As an example of such codecs, Dolby TrueHD is an audio codec that supports lossless and scalable transmission of audio signals. The source audio is encoded into a hierarchy of substreams where only a subset of the substreams need to be retrieved from the bitstream and decoded, in order to obtain a lower dimensional (or downmix) presentation of the spatial scene, and when all the substreams are decoded the resultant audio is identical to the source audio. Although embodiments may be described and illustrated with respect to TrueHD systems, it should be noted that any other similar HD audio codec system may also be used. The term “TrueHD” is thus meant to include all possible HD type codecs. Technical details of Dolby TrueHD, and the Meridian Lossless Packing (MLP) technology on which it is based, are well known. Aspects of TrueHD and MLP technology are described in U.S. Pat. No. 6,611,212, issued Aug. 26, 2003, and assigned to Dolby Laboratories Licensing Corp., and the paper by Gerzon, et al., entitled “The MLP Lossless Compression System for PCM Audio,” J. AES, Vol. 52, No. 3, pp. 243-260 (March 2004).
TrueHD supports specification of downmix matrices. In typical use, the content creator of a 7.1 channel audio program specifies a static matrix to downmix the 7.1 channel program to a 5.1 channel mix, and another static matrix to downmix the 5.1 channel downmix to a 2 channel (stereo) downmix Each static downmix matrix may be converted to a sequence of downmix matrices (each matrix in the sequence for downmixing a different interval in the program) in order to achieve clip-protection. However, each matrix in the sequence is transmitted (or metadata determining each matrix in the sequence is transmitted) to the decoder, and the decoder does not perform interpolation on any previously specified downmix matrix to determine a subsequent matrix in a sequence of downmix matrices for a program.
The TrueHD bitstream carries a set of output primitive matrices and channel assignments that are applied to the appropriate subset of the internal channels to derive the required downmix/lossless presentation. At the TrueHD encoder the primitive matrices are designed so that the specified downmix matrices can be achieved (or closely achieved) by the cascade of input channel assignment, input primitive matrices, output primitive, matrices and output channel assignment. If the specified matrix is static, i.e., time-invariant, it is possible to design the primitive matrices and channel assignments just once and employ the same decomposition throughout the audio signal. However when it is desired that the adaptive audio content be transmitted via TrueHD, such that the bitstream is hierarchical and supports deriving a number of downmixes by accessing only an appropriate subset of the internal channels, the specified downmix matrix/matrices evolve over time as the objects move. In this case a time-varying decomposition is needed and a single set of channel assignments will not work at all time (a set of channel assignments at a given time corresponds to the channel assignment for all the substreams in the bitstream at that time).
A “restart interval” in a TrueHD bitstream is a segment of audio that has been encoded such that it can be decoded independently of any segment that appears before or after it, i.e., it is a possible random access point. The TrueHD encoder divides up the audio signal into consecutive sub-segments, each of which is encoded as a restart interval. A restart interval is typically constrained to be 8 to 128 access units (AUs) in length. An access unit (defined for a particular audio sampling frequency) is a segment of a fixed number of consecutive samples. At 48 kHz sampling frequency a TrueHD AU is of length 40 samples or spans 0.833 milliseconds. The channel assignment for each substream can only be specified once every restart interval as per constraints in the bitstream syntax. The rationale behind this is to group audio associated with similarly decomposable downmix matrices together into a restart interval, and benefit from bitrate savings associated with not having to send the channel assignment each time the downmix matrix is updated (within the restart).
In legacy TrueHD systems, the downmix specification generally static, and hence it is conceivable that a prototype decomposition/channel assignment could be employed for encoding the entire length of the audio signal. Thus, restart intervals could be made as large as possible (128 AUs), and the audio signal was divided uniformly into restart intervals of this maximum size. This is no more feasible in the case where adaptive audio content has to be transmitted via TrueHD since the downmix matrices are dynamic. In other words, it is necessary to examine the evolution of downmix matrices over time and divide the audio signal into intervals over which a single channel assignment could be employed to decompose the specified downmix matrices throughout that sub-segment. Therefore, it is advantageous to segment the audio into restart intervals of potentially varying length while accounting for the dynamics of the downmix matrix trajectory.
Current systems also do not utilize spatial cues of objects in adaptive audio content when segmenting the audio. Thus, it would also be advantageous to partition the audio into segments based on the spatial metadata associated with adaptive audio objects and that describes the continuing motion of these objects for rendering through discrete speaker channels.
The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions. Dolby, Dolby TrueHD, and Atmos are trademarks of Dolby Laboratories Licensing Corporation.