Dolby, Dolby TrueHD, and Atmos are trademarks of Dolby Laboratories Licensing Corporation.
The complexity, and financial and computational cost, of rendering audio programs increases with the number of channels to be rendered. During rendering and playback of object based audio programs, the audio content has a number of channels (e.g., object channels and speaker channels) which is typically much larger (e.g., by an order of magnitude) than the number occurring during rendering and playback of conventional speaker-channel based programs. Typically also, the speaker system used for playback includes a much larger number of speakers than the number employed for playback of conventional speaker-channel based programs.
Although embodiments of the invention are useful for rendering channels of any multichannel audio program, many embodiments of the invention are especially useful for rendering channels of object-based audio programs having a large number of channels.
It is known to employ playback systems (e.g., in movie theaters) to render object based audio programs. Object based audio programs may be indicative of many different audio objects corresponding to images on a screen, dialog, noises, and sound effects that emanate from different places on (or relative to) the screen, as well as background music and ambient effects (which may be indicated by speaker channels of the program) to create the intended overall auditory experience. Accurate playback of such programs requires that sounds be reproduced in a way that corresponds as closely as possible to what is intended by the content creator with respect to audio object size, position, intensity, movement, and depth.
During generation of object based audio programs, it is typically assumed that the loudspeakers to be employed for rendering are located in arbitrary locations in the playback environment; not necessarily in a predetermined arrangement in a (nominally) horizontal plane or in any other predetermined arrangement known at the time of program generation. Typically, metadata included in the program indicates rendering parameters for rendering at least one object of the program at an apparent spatial location or along a trajectory (in a three dimensional volume), e.g., using a three-dimensional array of speakers. For example, an object channel of the program may have corresponding metadata indicating a three-dimensional trajectory of apparent spatial positions at which the object (indicated by the object channel) is to be rendered. The trajectory may include a sequence of “floor” locations (in the plane of a subset of speakers which are assumed to be located on the floor, or in another horizontal plane, of the playback environment), and a sequence of “above-floor” locations (each determined by driving a subset of the speakers which are assumed to be located in at least one other horizontal plane of the playback environment).
Object based audio programs represent a significant improvement in many respects over traditional speaker channel-based audio programs, since speaker-channel based audio is more limited with respect to spatial playback of specific audio objects than is object channel based audio. Speaker channel-based audio programs consist of speaker channels only (not object channels), and each speaker channel typically determines a speaker feed for a specific, individual speaker in a listening environment.
Various methods and systems for generating and rendering object based audio programs have been proposed. Examples of rendering of object based audio programs are described, for example, in PCT International Application No. PCT/US2011/028783, published under International Publication No. WO 2011/119401 A2 on Sep. 29, 2011, and assigned to the assignee of the present application.
An object-based audio program may include “bed” channels. A bed channel may be an object channel indicative of an object whose position does not change over the relevant time interval (and so is typically rendered using a set of playback system speakers having static speaker locations), or it may be a speaker channel (to be rendered by a specific speaker of a playback system). Bed channels do not have corresponding time varying position metadata (though they may be considered to have time-invariant position metadata). They may by indicative of audio elements that are dispersed in space, for instance, audio indicative of ambience.
Professional and consumer-level audio-visual (AV) systems (e.g., the Dolby® Atmos™ system) have been developed to render hybrid audio content of object-based audio programs that include both bed channels and object channels that are not bed channels.
Playback of an object-based audio program over a traditional speaker set-up (e.g., a 7.1 playback system) is achieved by rendering channels of the program (including object channels) to a set of speaker feeds. In typical embodiments of the invention, the process of rendering object channels (sometimes referred to herein as objects) and other channels of an object-based audio program (or channels of an audio program of another type) comprises in large part (or solely) a conversion of spatial metadata (for the channels to be rendered) at each time instant into a corresponding gain matrix (referred to herein as a “rendering matrix”) which represents how much each of the channels (e.g., object channels and speaker channels) contributes to a mix of audio content (at the instant) indicated by the speaker feed for a particular speaker (i.e., the relative weight of each of the channels of the program in the mix indicated by the speaker feed).
An “object channel” of an object-based audio program is indicative of a sequence of samples indicative of an audio object, and the program typically includes a sequence of spatial position metadata values indicative of object position or trajectory for each object channel. In typical embodiments of the invention, sequences of metadata values (e.g., position metadata values) corresponding to a number (N) of object channels (and/or speaker channels) of a program are used to determine an M×N rendering matrix, A(t), indicative of a time-varying gain specification for the program, and the rendering matrix is applied to render the channels for playback by a number (“M”) of speakers.
Rendering of “N” channels (e.g., object channels, or object channels and speaker channels, or speaker channels which may but need not be indicative of mixed and otherwise processed content of a greater number of object channels) of an audio program to “M” speakers (speaker feeds) at time “t” of the program can be represented by multiplication of a vector x(t) of length “N” (i.e., an N×1 matrix, x(t)), comprised of an audio sample at time “t” from each channel, by an M×N matrix A(t) determined from associated metadata (e.g., position metadata and optionally other metadata corresponding to the audio content to be rendered, e.g., object gains) at time “t”. The resultant values (e.g., gains or levels) of the speaker feeds at time t can be represented as a vector y(t), as in the following equation (1):
                                          [                                                                                                      y                      0                                        ⁡                                          (                      t                      )                                                                                                                                                              y                      1                                        ⁡                                          (                      t                      )                                                                                                                    ⋮                                                                                                                        y                                              M                        -                        1                                                              ⁡                                          (                      t                      )                                                                                            ]                                y            ⁡                          (              t              )                                      =                                            [                                                                                                                  a                        00                                            ⁡                                              (                        t                        )                                                                                                                                                a                        01                                            ⁡                                              (                        t                        )                                                                                                                                                a                        02                                            ⁡                                              (                        t                        )                                                                                                  ⋱                                                                                                      a                                                  0                          ,                                                      N                            -                            1                                                                                              ⁡                                              (                        t                        )                                                                                                                                                                                a                        10                                            ⁡                                              (                        t                        )                                                                                                  ⋱                                                        ⋱                                                        ⋱                                                        ⋱                                                                                        ⋱                                                        ⋱                                                        ⋱                                                        ⋱                                                        ⋱                                                                                                                                      a                                                                              M                            -                            1                                                    ,                          0                                                                    ⁡                                              (                        t                        )                                                                                                  ⋱                                                        ⋱                                                        ⋱                                                                                                      a                                                                              M                            -                            1                                                    ,                                                      N                            -                            1                                                                                              ⁡                                              (                        t                        )                                                                                                        ]                                      A              ⁡                              (                t                )                                              ⁢                                                    [                                                                                                                              x                          0                                                ⁡                                                  (                          t                          )                                                                                                                                                                                                  x                          1                                                ⁡                                                  (                          t                          )                                                                                                                                                                                                  x                          2                                                ⁡                                                  (                          t                          )                                                                                                                                                ⋮                                                                                                                                                    x                                                      N                            -                            1                                                                          ⁡                                                  (                          t                          )                                                                                                                    ]                                            x                ⁡                                  (                  t                  )                                                      .                                              (        1        )            
Although equation (1) describes the rendering of N channels of an audio program (e.g., an object-based audio program, or an encoded version of an object-based audio program) into M output channels (e.g., M speaker feeds), it also represents a generic set of scenarios in which a set of N audio samples is converted to a set of M values (e.g., M samples) by linear operations. For example, A(t) could be a static matrix, “A”, whose coefficients do not vary with different values of time “t”. For another example, A(t) (which could be a static matrix, A) could represent a conventional downmix of a set of speaker channels x(t) to a smaller set of speaker channels y(t), or x(t) could be a set of audio channels that describe a spatial scene in an Ambisonics format, and the conversion to speaker feeds y(t) could be prescribed as multiplication by the rendering matrix A(t). In this context, the M×N rendering matrix is sometimes referred to as a “downmix matrix” (although in general, M need not satisfy M<N in equation (1)). Even in an application employing a nominally static downmix matrix, the actual linear transformation (matrix multiplication) applied may be dynamic in order to ensure clip-protection of the downmix (i.e., a static transformation A may be converted to a time-varying transformation A(t), to ensure clip-protection).
Dolby TrueHD is a conventional audio codec format that supports lossless and scalable transmission of audio signals. The source audio is encoded into a hierarchy of substreams of channels, and a selected subset of the substreams (rather than all of the substreams) may be retrieved from the bitstream and decoded, in order to obtain a lower dimensional (downmix) presentation of the spatial scene. Typically, when all the substreams (sometimes referred to herein collectively as a “top” substream) are decoded and rendered, the resultant audio is identical to the source audio (i.e., the encoding, followed by the decoding, is lossless).
In a commercially available version of TrueHD, the source audio is typically a 7.1 channel mix which is encoded into a sequence of three substreams, including a first substream which can be decoded to determine a two channel downmix of the 7.1 channel original audio. The first two substreams may be decoded to determine a 5.1 channel downmix of the original audio. All three substreams (i.e., a top substream of the encoded bitstream) may be decoded to determine the original 7.1 channel audio. Technical details of Dolby TrueHD, and the Meridian Lossless Packing (MLP) technology on which it is based, are well known. Aspects of TrueHD and MLP technology are described in U.S. Pat. No. 6,611,212, issued Aug. 26, 2003, and assigned to Dolby Laboratories Licensing Corp., and the paper by Gerzon, et al., entitled “The MLP Lossless Compression System for PCM Audio,” J. AES, Vol. 52, No. 3, pp. 243-260 (March 2004).
TrueHD supports specification of downmix matrices. In typical use, the content creator of a 7.1 channel audio program specifies a static matrix to downmix the 7.1 channel program to a 5.1 channel mix, and another static matrix to downmix the 5.1 channel downmix to a 2 channel downmix. Each static downmix matrix may be converted to a sequence of downmix matrices (each matrix in the sequence for downmixing a different interval in the program) in order to achieve clip-protection.
A program encoded in accordance with the Dolby TrueHD format may be indicative of N channels (e.g., N object channels) and also at least one downmix presentation. Each downmix presentation comprises M downmix channels (where, in this context, M is an integer less than N), and its audio content is a mix of audio content of all or some of the content of the N channels. The program (as delivered to a decoder) includes internally coded channels, and metadata indicative of matrix operations to be performed by a decoder on all or some of the internally coded channels. Some such matrix operations are performed by the decoder on all the internally coded channels such that combined operation of both the encoder and decoder implements a multiplication by a matrix A(t) (of the type indicated in equation (1)) on the full set of N channels (corresponding to the vector x(t) of equation (1)). Other ones of such matrix operations are performed by the decoder on a subset of the internally coded channels such that combined operation of both the encoder and decoder implements a multiplication by an M×N matrix A(t) (of the type indicated in equation (1), where M is less than N, and N is the number of channels in the full set of input channels) on the original N input channels.
A legacy device (e.g., a device configured to decode and render at least one downmix presentation embedded in a TrueHD program instead of decoding and rendering (e.g., losslessly) the full set of N channels indicated by the program, which may be N object channels) may in fact be an older device that is unable to decode and render the full set of N channels (e.g., object channels) indicated by the program, or it may be another device configured (e.g., to implement a conscious choice by a user) to decode and render at least one such downmix presentation. Legacy content of a TrueHD bitstream may be characterized by a well-structured time-invariant downmix matrix (e.g., a standard 7.1 ch to 5.1 ch downmix matrix). In such a case, the metadata (included in the TrueHD bitstream by the encoder) indicative of a matrix operation to be implemented by a legacy decoder to render a downmix presentation needs to be determined only once by the encoder for the entire audio signal. Alternatively, legacy content of a TrueHD bitstream may be adaptive audio content characterized by a sequence of different downmix matrices (or a continuously varying downmix matrix) that may also be quite arbitrary, and the full set of N channels (e.g., N object channels) indicated by the bitstream may be large (e.g., N may be as large as 16 in the Atmos version of Dolby TrueHD). Thus a static downmix matrix (i.e., a static version of rendering matrix, A, of equation (1) may not suffice to enable a legacy decoder to render a downmix presentation indicated by a TrueHD program, and instead a sequence of downmix matrices (or a continuously varying downmix matrix) may be required.
In accordance with the TrueHD format, each rendering matrix A(t) applied to audio content of a TrueHD bitstream is decomposed into a cascade of matrices, some of which are typically applied at the encoder and others of which are typically applied at the decoder/renderer:A(t)=QQs-1(t) . . . Q0(t)I Pr-1(t) . . . P0(t)P, where r and s are numbers, each of the r matrices Pr-1(t), . . . , P0(t) is a primitive matrix of size N×N (these r matrices are referred to herein as input primitive matrices), each of the s matrices Qs-1(t), . . . , Q0(t) is a primitive matrix of size M×M (these s matrices are referred to herein as output primitive matrices), matrices P and Q determine input and output channel assignments, respectively. We will use the notation A(t) in a generic sense to refer to a true downmix (M<N), an upmix (M>N), or an N-to-N transformation (M=N). In the above equation I is the M×N row selector matrix whose first M columns are the same as the M×M identity matrix and the last N−M columns are zeros if M<N. On the other hand if M>N, I is a “tall” matrix whose first N rows are the N×N identity matrix and the last M−N rows are zeros. Specifically, when M≤N, the effect of the I matrix is to select the first M rows of the product Pr-1(t) . . . P0(t)P, and hence we refer to the matrix as a “row selector” matrix. Henceforth in this description we will simply refer to the notation I as a row selector matrix irrespective of the relation between M and N, with it being assumed that it has the appropriate structure based on the relation between M and N. The product (cascade) of matrices applied at the decoder/renderer is sometimes referred to as U=QQs-1(t) . . . Q0(t), and the product (cascade) of matrices applied at the encoder is denoted by V=Pr-1(t) . . . P0(t)P. The purpose of the row selector matrix I indicated above in this paragraph is to specify the specific internal channels of the encoded program to which the decoder must apply U. In cases in which an N-to-N transformation is to be implemented, the decoder will need to apply U to all N internal channels of the encoded program and the row selector matrix will be an identity matrix.
Some embodiments of the present invention include, in an encoded audio program, metadata indicative of a cascade of primitive matrices (e.g., a cascade U as required by the TrueHD format) which may be used by a decoder (e.g., a legacy decoder) to render a downmix presentation indicated by the program.
An audio program rendering system (e.g., a decoder implementing such a system) may receive metadata (of an encoded audio program) which determine (e.g., with a time-varying cascade of matrices applied by an encoder) a time-varying rendering matrix A(t) only intermittently, and not at every instant “t” during the program. For example, this could be due to any of a variety of reasons, e.g., low time resolution of the system that actually outputs the metadata or the need to limit the bit rate of transmission of the program. It may be desirable for a rendering system to interpolate between rendering matrices A(t1) and A(t2), at time instants “t1” and “t2” during a program, respectively, to obtain a rendering matrix A(t3) for an intermediate time instant “t3.” Interpolation ensures that the perceived position of objects in the rendered speaker feeds varies smoothly over time, and may eliminate undesirable artifacts such as zipper noise that stem from discontinuous (piece-wise constant) matrix updates. The interpolation may be linear (or nonlinear), and typically should ensure a continuous path in time from A(t1) to A(t2).
FIG. 1 is a schematic diagram of elements of a conventional TrueHD system, in which the encoder (30) and decoder (32) are configured to implement matrixing operations on audio samples. In the FIG. 1 system, encoder 30 is configured to encode 8 audio input channels as an encoded bitstream indicative of an 8-channel audio program (e.g., a traditional set of 7.1 speaker feeds), said encoded bitstream including two substreams, and decoder 32 is configured to decode the encoded bitstream to render either the 8-channel program (losslessly) or a 2-channel downmix of the original 8-channel program. Encoder 30 is coupled and configured to generate the encoded bitstream and to assert the encoded bitstream to delivery system 31.
In variations on the FIG. 1 system, N audio input signals are asserted to the encoder (where N is not equal to 8), the encoded bitstream output from the encoder has N channels (internal channels), and the decoder is configured to render at least one mix (e.g., a downmix comprising 2 channels) of content of the N input signals. The structure and operation of such variations will be apparent from the description herein of the FIG. 1 system (e.g., by generalizing the description herein by replacing the specific value “eight” with the general value “N”).
Delivery system 31 is coupled and configured to deliver (e.g., by storing and/or transmitting) the encoded bitstream to decoder 32. In some embodiments, system 31 implements delivery of (e.g., transmits) an encoded multichannel audio program over a broadcast system or a network (e.g., the internet) to decoder 32. In some embodiments, system 31 stores an encoded multichannel audio program in a storage medium (e.g., a disk or set of disks), and decoder 32 is configured to read the program from the storage medium.
The block labeled “InvChAssign1” in encoder 30 is configured to perform channel permutation (equivalent to multiplication by a permutation matrix) on the input channels. The permutated channels then undergo encoding in stage 33, which outputs eight encoded signal channels. The encoded signal channels may (but need not) correspond to playback speaker channels. The encoded signal channels are sometimes referred to as “internal” channels since a decoder (and/or rendering system) typically decodes and renders the content of the encoded signal channels to recover the input audio, so that the encoded signal channels are “internal” to the encoding/decoding system. The encoding performed in stage 33 is equivalent to multiplication of each set of samples of the permutated channels by an encoding matrix (implemented as a cascade of n+1=r matrix multiplications, identified in FIG. 1 as Pn−1, . . . , P1−1, P0−1, to be described below in greater detail). The matrices identified in FIG. 1 as Pn−1, . . . , P1−1, P0−1, are primitive matrices which correspond, respectively, to primitive matrices Pr-1(t), . . . , P0(t) in the above-described cascade V=Pr-1(t) . . . P0(t)P, in which P corresponds to the permutation matrix applied by block “InvChAssign1” of encoder 30.
Matrix determination subsystem 34 is configured to generate data indicative of the coefficients of two sets of output matrices (each of these sets corresponding to a different one of two substreams of the encoded channels). One set of output matrices consists of two matrices, P02, P12, each of which is a primitive matrix (defined below) of dimension 2×2, and is for rendering a first substream (a downmix substream) comprising two of the encoded audio channels of the encoded bitstream (to render a two-channel downmix of an eight-channel audio program). The matrices identified in FIG. 1 as P12, P02, are primitive matrices which correspond, respectively, to primitive matrices Q1(t), and Q0(t) in the above-described cascade U=QQs-1(t) . . . Q0(t), with index s=2, and Q corresponding to a permutation matrix to be applied by the block “Ch Assign 0” of decoder 32. The other set of output matrices determined by subsystem 34 consists of rendering matrices, P0, P1, . . . , Pn, each of which is a primitive matrix, and is for rendering a top substream comprising all eight of the encoded audio channels of the encoded bitstream (for recovery of the eight-channel audio program). The matrices identified in FIG. 1 as Pn, . . . , P1, P0, correspond, respectively, to primitive matrices Qs-1(t), . . . , and Q0(t) in the above-described cascade U=QQs-1(t) . . . Q0(t), with index s=n+1, and Q corresponding to a permutation matrix to be applied by the block “Ch Assign 1” of decoder 32.
A cascade of the matrices P0−1, P1−1, . . . , Pn−1 of FIG. 1, applied to the audio (with the necessary permutation matrix P) at the encoder, together with a cascade of the matrices, P02, P12, applied to the audio at the decoder (with the necessary permutation matrix Q), is equal to the downmix matrix specification that transforms the 8 audio channels indicated by the encoded bitstream to the 2-channel downmix. A cascade of the matrices P0−1, P1−1, . . . , Pn−1 of FIG. 1, applied to the audio (with the necessary permutation matrix P) at the encoder, together with a cascade of the matrices P0, P1, . . . , Pn of FIG. 1, applied to the audio at the decoder (with the necessary permutation matrix Q), renders the full set of 8 encoded channels of the encoded bitstream.
The coefficients (of each of matrix) that are output from subsystem 34 of encoder 30 to packing subsystem 35 are metadata indicating relative or absolute gain of each channel to be included in a corresponding mix of channels of the program. The coefficients of each rendering matrix (for an instant of time during the program) represent how much each of the channels of a mix should contribute to the mix of audio content (at the corresponding instant of the rendered mix) indicated by the speaker feed for a particular playback system speaker.
The eight encoded audio channels (output from encoding stage 33), the output matrix coefficients (generated by subsystem 34), and typically also additional data are asserted to packing subsystem 35, which assembles them into the encoded bitstream which is then asserted to delivery system 31. Encoder 30 may also include in the encoded bitstream (to be asserted to delivery system 31) values indicative of the permutation matrix Q to be applied by block “Ch Assign 0” of decoder 32.
The encoded bitstream includes data indicative of the eight encoded audio channels, the two sets of output matrices, and typically also additional data (e.g., metadata regarding the audio content).
Parsing subsystem 36 of decoder 32 is configured to accept (read or receive) the encoded bitstream from delivery system 31 and to parse the encoded bitstream. Subsystem 36 is operable to assert the substreams of the encoded bitstream, including a “first” substream comprising only two of the encoded channels of the encoded bitstream, and output matrices (P02, P12) corresponding to the first substream, to matrix multiplication stage 38 (for processing which results in a 2-channel downmix presentation of content of the full 8-channel program). Subsystem 36 is also operable to assert all the substreams (i.e., a top substream) of the encoded bitstream (comprising all eight encoded channels of the encoded bitstream) and corresponding output matrices (P0, P1, . . . , Pn) to matrix multiplication stage 37 for processing which results in recovery and rendering of the full 8-channel program.
More specifically, stage 38 multiplies two audio samples of the two channels of the first substream by a cascade of the matrices P02, P12 , and each resulting set of two linearly transformed samples undergoes channel permutation (equivalent to multiplication by a permutation matrix) represented by the block titled “ChAssign0” to yield each pair of samples of the required 2 channel downmix of the 8 original audio channels. The cascade of matrixing operations performed in encoder 30 and decoder 32 is equivalent to application of a downmix matrix specification that transforms the 8 input audio channels to the 2-channel downmix.
Stage 37 multiplies each vector of eight audio samples (one from each of the full set of eight channels of the encoded bitstream) by a cascade of the matrices P0, P1, . . . , Pn, and each resulting set of eight linearly transformed samples undergoes channel permutation (equivalent to multiplication by a permutation matrix) represented by the block titled “ChAssign1” to yield each set of eight samples of the recovered 8-channel program. In order that the output 8 channel audio is exactly the same as the 8-channel audio originally input to encoder 30 (to achieve the “lossless” characteristic of the system), the matrixing operations performed in encoder 30 should be exactly (including quantization effects) the inverse of the matrixing operations performed in decoder 32 on all substreams of the encoded bitstream (i.e., multiplication by the cascade of matrices P0, P1, . . . , Pn). Thus, in FIG. 1, the matrixing operations in stage 33 of encoder 30 are identified as a cascade of the inverse matrices of the matrices P0, P1, . . . , Pn, in the opposite sequence applied in stage 37 of decoder 32, namely: Pn−1, . . . , P1−1, P0−1.
When reconstructing the original N input signals losslessly, decoder 32 applies the inverse of the channel permutation applied by encoder 30 (i.e., the permutation matrix represented by element “ChAssign1” of decoder 32 is the inverse of that represented by element “InvChAssign1” of encoder 30).
Given a downmix matrix specification (e.g., specification of a static matrix A that is 2×8 in dimension), an objective of a conventional TrueHD encoder implementation of encoder 30 is to design output matrices (e.g., P0, P1, . . . , Pn and P02, P12 of FIG. 1), and input matrices (Pn−1, . . . , P1−1, P0−1) and output (and input) channel assignments so that:                1. the encoded bitstream is hierarchical (i.e., in the example, the first two encoded channels are sufficient to derive the 2 channel downmix presentation, and the full set of eight encoded channels is sufficient to recover the original 8 input signals); and        2. the matrices for the topmost stream (P0, P1, . . . , Pn in the example) are exactly invertible so that the input audio is exactly retrievable by the decoder.        
Typical computing systems work with finite precision and inverting an arbitrary invertible matrix exactly could require very large precision. TrueHD solves this problem by constraining the output matrices and input matrices (i.e., P0, P1, . . . , Pn and Pn−1, . . . , P1−1, P0−1) to be square matrices of the type known as “primitive matrices”.
A primitive matrix P of dimension N×N is of the form:
  P  =            [                                    1                                0                                ⋱                                ⋱                                0                                                0                                1                                0                                ⋱                                ⋱                                                              α              0                                                          α              1                                                          α              2                                            ⋱                                              α                              N                -                1                                                                          ⋮                                ⋱                                ⋱                                ⋱                                ⋱                                                0                                0                                0                                0                                1                              ]        .  
A primitive matrix is always a square matrix. A primitive matrix of dimension N×N is identical to the identity matrix of dimension N×N except for one (non-trivial) row (i.e., the row comprising elements α0, α1, α2, . . . αN−1 in the example). In all other rows, the off-diagonal elements are zeros and the element shared with the diagonal has an absolute value of 1 (i.e., either +1 or −1). To simplify language in this disclosure, the drawings and descriptions will always assume that a primitive matrix has diagonal elements that are equal to +1 with the possible exception of the diagonal element in the non-trivial row. However, we note that this is without loss of generality, and ideas presented in this disclosure pertain to the general class of primitive matrices where diagonal elements may be +1 or −1.
When a primitive matrix, P, operates on (i.e., multiplies) a vector x(t), the result is the product Px(t), which is another N-dimensional vector that is exactly the same as x(t) in all elements except one. Thus each primitive matrix can be associated with a unique channel which it manipulates (or on which it operates).
We will use the term “unit primitive matrix” herein to denote a primitive matrix in which the element shared with the diagonal (by the non-trivial row of the primitive matrix) has an absolute value of 1 (i.e., either +1 or −1). Thus, the diagonal of a unit primitive matrix consists of all positive ones, +1, or all negative ones, −1, or some positive ones and some negative ones. A primitive matrix only alters one channel of a set (vector) of samples of audio program channels, and a unit primitive matrix is also losslessly invertible due to the unit values on the diagonal. Again, to simplify the discussion herein, we will use the term unit primitive matrix to refer to a primitive matrix whose non-trivial row has a diagonal element of +1. However, all references to unit primitive matrices herein, including in the claims, are intended to cover the more generic case where a unit primitive matrix can have a non-trivial row whose shared element with the diagonal is +1 or −1.
If α2=1 (resulting in a unit primitive matrix having a diagonal consisting of positive ones) in the above example of primitive matrix, P, it is seen that the inverse of P is exactly:
      P          -      1        =            [                                    1                                0                                ⋱                                ⋱                                0                                                0                                1                                0                                ⋱                                ⋱                                                              -                              α                0                                                                        -                              α                1                                                          1                                ⋱                                              -                              α                                  N                  -                  1                                                                                          ⋮                                ⋱                                ⋱                                ⋱                                ⋱                                                0                                0                                0                                0                                1                              ]        .  
It is true in general that the inverse of a unit primitive matrix is simply determined by inverting (multiplying by −1) each of its non-trivial a coefficients which does not lie along the diagonal.
If the matrices P0, P1, . . . , Pn employed in decoder 32 of FIG. 1 are unit primitive matrices (having unit diagonals), the sequence of matrixing operations Pn−1, . . . , P1−1, P0−1 in encoder 30 and P0, P1. . . , Pn in decoder 32 can be implemented by finite precision circuits. Details of typical implementations of such finite precision circuits are described in above-cited U.S. Pat. No. 6,611,212, issued Aug. 26, 2003.
Application of a sequence of cascades (i.e., two or more cascades, each corresponding to a different time, or just one cascade) of primitive matrices (e.g., a sequence of cascades of primitive N×N matrices P0−1, P1−1, . . . , Pn−1 of FIG. 1, each cascade P0−1, P1−1, . . . , Pn−1 in the sequence corresponding to a specific time) and a permutation matrix P (together corresponding to application of above-mentioned matrix cascade V=Pr-1(t) . . . P0(t)P) by an encoder (e.g., encoder 30 of FIG. 1), with application of a row selector matrix (corresponding to above-mentioned matrix I, where the row selector matrix is the N×N identity matrix for an N-to-N transformation), and application by a decoder (e.g., decoder 32 of FIG. 1) of a sequence of cascades of primitive matrices (e.g., a sequence of primitive matrix cascades P0, P1, . . . , Pn of FIG. 1) and another permutation matrix Q (together corresponding to application of above-mentioned matrix cascade U=QQs-1(t) . . . Q0(t)), to a vector of N audio input samples (each of which is a sample of a different channel of a first set of N channels) can implement any linear transformation of the samples into a new set of N samples. Such an operation on the input samples can implement the linear transformation performed at a time t by multiplying samples of N channels of an audio program (e.g., an object-based audio program) by an implementation of matrix A(t) of equation (1) having the above-described form, A(t)=QQs-1(t) . . . Q0(t) I Pr-1(t) . . . P0(t)P, during rendering of the channels into N speaker feeds, where the transformation is achieved by manipulating one channel at a time. Multiplication of a set of N audio samples by the sequence of matrices in this implementation of A(t) represents a generic set of scenarios in which the set of N input samples is converted to another set of N samples by linear operations.
With reference again to a TrueHD implementation of decoder 32 of FIG. 1, in order to maintain uniformity of TrueHD decoder architecture, the output matrices of the downmix substream (P02, P12 in FIG. 1) are also implemented as primitive matrices although they need not be invertible (or have a unit diagonal) since they are not associated with achieving losslessness.
The input and output primitive matrices employed in a TrueHD encoder and decoder to render a downmix depend on each particular downmix specification to be implemented. The function of a TrueHD decoder is to apply the appropriate cascade of primitive matrices to the received encoded audio bitstream. Thus, the TrueHD decoder of FIG. 1 decodes a subset of the 8 channels of the encoded bitstream (delivered by system 31), and generates a 2-channel downmix by applying a cascade of two output primitive matrices P02, P12 to a subset of the channels of the decoded bitstream. A TrueHD implementation of decoder 32 of FIG. 1 is also operable to decode the 8 channels of the encoded bitstream (delivered by system 31) to recover the full 8-channel program by applying a cascade of eight output primitive matrices P0, P1, . . . , Pn to the channels of the encoded bitstream.
If an object-based audio program (e.g., comprising more than eight channels) were encoded by a conventional TrueHD encoder, the encoder might generate one or more downmix substreams which carry presentations compatible with legacy playback devices (e.g., presentations which could be decoded to downmixed speaker feeds for playback on a traditional 7.1 channel or 5.1 channel or other traditional speaker array) and a top substream (indicative of all channels of the input program). A TrueHD decoder might recover the full object-based audio program losslessly for rendering by a playback system. Application of each input matrix sequence by the encoder in this case to generate a substream (e.g., a downmix substream), with application by a decoder/renderer of a corresponding output matrix sequence (determined by the encoder for application by the decoder/renderer), typically corresponds to application of a time-varying rendering matrix, A(t), which linearly transforms samples of input channels to generate a mix (e.g., a 7.1 channel or 5.1 channel downmix, or another downmix) of content of the original input channels. However, such a matrix A(t) would typically vary rapidly in time (e.g., as audio objects move around in the spatial scene), and bit-rate and processing limitations of a conventional TrueHD system (or other conventional encoding and decoding system) would typically constrain the system to be able at most to accommodate a piece-wise constant approximation to such a continuously (and typically rapidly) varying matrix specification (with a higher matrix update rate achieved at the cost of increased bit-rate for transmission of the encoded program). In order to support rendering of content of object-based multichannel audio programs (and other multichannel audio programs) into speaker feeds indicative of a rapidly varying mix of audio content of the programs, the inventors have recognized that it is desirable to apply a sequence of rendering matrices to input audio data (i.e., to apply an initial rendering matrix, and then a sequence of updated rendering matrices), with discontinuity compensation corresponding to at least one rendering matrix update (e.g., each rendering matrix update). The discontinuity compensation for each relevant update should compensate for a discontinuity in the rendered audio that would otherwise result (i.e., if the compensation were not performed) from the matrix update. It is contemplated that the rendering matrix updates are typically infrequent and that a desired trajectory (i.e., a desired sequence of mixes of content of input channels) between updates is specified parametrically. In some embodiments, in response to an update of a rendering matrix (e.g., in response to each transition between consecutive rendering matrices of a sequence of rendering matrices), a set of correction values is determined for application to audio data during encoding and/or rendering.