Dolby, Dolby Digital, Dolby Digital Plus, and Dolby E are trademarks of Dolby Laboratories Licensing Corporation. Dolby Laboratories provides proprietary implementations of AC-3 and E-AC-3 known as Dolby Digital and Dolby Digital Plus, respectively.
Although the invention is not limited to use in encoding audio data in accordance with the E-AC-3 (or AC-3 or Dolby E) format, or delivering, decoding or rendering E-AC-3, AC-3, or Dolby E encoded data, for convenience it will be described in embodiments in which it encodes an audio bitstream in accordance with the E-AC-3 or AC-3 or Dolby E format, and delivers, decodes, and renders such a bitstream.
A typical stream of audio data includes both audio content (e.g., one or more channels of audio content) and metadata indicative of at least one characteristic of the audio content. For example, in an AC-3 bitstream there are several audio metadata parameters that are specifically intended for use in changing the sound of the program delivered to a listening environment.
An AC-3 or E-AC-3 encoded bitstream comprises metadata and can comprise one to six channels of audio content. The audio content is audio data that has been compressed using perceptual audio coding. Details of AC-3 coding are well known and are set forth in many published references including the following:
ATSC Standard A52/A: Digital Audio Compression Standard (AC-3), Revision A, Advanced Television Systems Committee, 20 Aug. 2001; and
U.S. Pat. Nos. 5,583,962; 5,632,005; 5,633,981; 5,727,119; and 6,021,386.
Details of Dolby Digital Plus (E-AC-3) coding are set forth in, for example, “Introduction to Dolby Digital Plus, an Enhancement to the Dolby Digital Coding System,” AES Convention Paper 6196, 117th AES Convention, Oct. 28, 2004.
Details of Dolby E coding are set forth in “Efficient Bit Allocation, Quantization, and Coding in an Audio Distribution System”, AES Preprint 5068, 107th AES Conference, August 1999 and “Professional Audio Coder Optimized for Use with Video”, AES Preprint 5033, 107th AES Conference August 1999.
Each frame of an AC-3 encoded audio bitstream contains audio content and metadata for 1536 samples of digital audio. For a sampling rate of 48 kHz, this represents 32 milliseconds of digital audio or a rate of 31.25 frames per second of audio.
Each frame of an E-AC-3 encoded audio bitstream contains audio content and metadata for 256, 512, 768 or 1536 samples of digital audio, depending on whether the frame contains one, two, three or six blocks of audio data respectively. For a sampling rate of 48 kHz, this represents 5.333, 10.667, 16 or 32 milliseconds of digital audio respectively or a rate of 189.9, 93.75, 62.5 or 31.25 frames per second of audio respectively.
As indicated in FIG. 1, each AC-3 frame is divided into sections (segments), including: a Synchronization Information (SI) section which contains (as shown in FIG. 2) a synchronization word (SW) and the first of two error correction words (CRC1); a Bitstream Information (BSI) section which contains most of the metadata; six Audio Blocks (AB0 to AB5) which contain data compressed audio content (and can also include metadata); waste bits (W) which contain any unused bits left over after the audio content is compressed; an Auxiliary (AUX) information section which may contain more metadata; and the second of two error correction words (CRC2).
As indicated in FIG. 4, each E-AC-3 frame is divided into sections (segments), including: a Synchronization Information (SI) section which contains (as shown in FIG. 2) a synchronization word (SW); a Bitstream Information (BSI) section which contains most of the metadata; between one and six Audio Blocks (AB0 to AB5) which contain data compressed audio content (and can also include metadata); waste bits (W) which contain any unused bits left over after the audio content is compressed; an Auxiliary (AUX) information section which may contain more metadata; and an error correction word (CRC).
In an AC-3 (or E-AC-3) bitstream there are several audio metadata parameters that are specifically intended for use in changing the sound of the program delivered to a listening environment. One of the metadata parameters is the DIALNORM parameter, which is included in the BSI segment.
As shown in FIG. 3, the BSI segment of an AC-3 frame (or an E-AC-3 frame) includes a five-bit parameter (“DIALNORM”) indicating the DIALNORM value for the program. A five-bit parameter (“DIALNORM2”) indicating the DIALNORM value for a second audio program carried in the same AC-3 frame is included if the audio coding mode (“acmod”) of the AC-3 frame is “0”, indicating that a dual-mono or “1+1” channel configuration is in use.
The BSI segment also includes a flag (“addbsie”) indicating the presence (or absence) of additional bit stream information following the “addbsie” bit, a parameter (“addbsil”) indicating the length of any additional bit stream information following the “addbsil” value, and up to 64 bits of additional bit stream information (“addbsi”) following the “addbsil” value.
The BSI segment includes other metadata values not specifically shown in FIG. 3.
It has been proposed to include metadata of other types in audio bitstreams. For example, methods and systems for generating, decoding, and processing audio bitstreams including metadata indicative of the processing state (e.g., the loudness processing state) and characteristics (e.g., loudness) of audio content are described in PCT International Application Publication Number WO 2012/075246 A2, having international filing date Dec. 1, 2011, and assigned to the assignee of the present application. This reference also describes adaptive processing of the audio content of the bitstreams using the metadata, and verification of validity of the loudness processing state and loudness of audio content of the bitstreams using the metadata.
Methods for generating and rendering object based audio programs are also known. During generation of such programs, it may be assumed that the loudspeakers to be employed for rendering are located in arbitrary locations in the playback environment (or that the speakers are in a symmetric configuration in a unit circle). It need not be assumed that the speakers are necessarily in a (nominally) horizontal plane or in any other predetermined arrangements known at the time of program generation. Typically, metadata included in the program indicates rendering parameters for rendering at least one object of the program at an apparent spatial location or along a trajectory (in a three dimensional volume), e.g., using a three-dimensional array of speakers. For example, an object channel of the program may have corresponding metadata indicating a three-dimensional trajectory of apparent spatial positions at which the object (indicated by the object channel) is to be rendered. The trajectory may include a sequence of “floor” locations (in the plane of a subset of speakers which are assumed to be located on the floor, or in another horizontal plane, of the playback environment), and a sequence of “above-floor” locations (each determined by driving a subset of the speakers which are assumed to be located in at least one other horizontal plane of the playback environment). Examples of rendering of object based audio programs are described, for example, in PCT International Application No. PCT/US2001/028783, published under International Publication No. WO 2011/119401 A2 on Sep. 29, 2011, and assigned to the assignee of the present application.
Above-cited U.S. Provisional Patent Application No. 61/807,922 and above-cited U.S. Provisional Patent Application No. 61/832,397 describe object based audio programs which are rendered so as to provide an immersive, personalizable perception of the program's audio content. The content may be indicative of the atmosphere at (i.e., sound occurring in or at) and/or commentary on a spectator event (e.g., a soccer or rugby game, or another sporting event). The audio content of the program may be indicative of multiple audio object channels (e.g., indicative of user-selectable objects or object sets, and typically also a default set of objects to be rendered in the absence of object selection by the user) and at least one bed of speaker channels. The bed of speaker channels may be a conventional mix (e.g., a 5.1 channel mix) of speaker channels of a type that might be included in a conventional broadcast program which does not include an object channel.
Above-cited U.S. Provisional Patent Applications No. 61/807,922 and No. 61/832,397 describe object related metadata delivered as part of an object based audio program which provides mixing interactivity on the playback side, including by allowing an end user to select a mix of audio content of the program for rendering, instead of merely allowing playback of a pre-mixed soundfield. For example, a user may select among rendering options provided by metadata of a typical embodiment of the inventive program to select a subset of available object channels for rendering, and optionally also the playback level of at least one audio object (sound source) indicated by the object channel(s) to be rendered. The spatial location at which each selected sound source is rendered may be predetermined by metadata included in the program, but in some embodiments can be selected by the user (e.g., subject to predetermined rules or constraints). In some embodiments, metadata included in the program allows user selection from among a menu of rendering options (e.g., a small number of rendering options, for example, a “home team crowd noise” object, a “home team crowd noise” and a “home team commentary” object set, an “away team crowd noise” object, and an “away team crowd noise” and “away team commentary” object set). The menu may be presented to the user by a user interface of a controller, and the controller may be coupled to a set top device (or other device) configured to decode and render (at least partially) the object based program. Metadata included in the program may otherwise allow user selection from among a set of options as to which object(s) indicated by the object channels should be rendered, and as to how the object(s) to be rendered should be configured.
U.S. Provisional Patent Applications No. 61/807,922 and No. 61/832,397 describe an object based audio program which is an encoded audio bitstream indicative of at least some of the program's audio content (e.g., a bed of speaker channels and at least some of the program's object channels) and object related metadata. At least one additional bitstream or file may be indicative of some of the program's audio content (e.g., at least some of the object channels) and/or object related metadata. In some embodiments, object related metadata provides a default mix of object content and bed (speaker channel) content, with default rendering parameters (e.g., default spatial locations of rendered objects). In some embodiments, object related metadata provides a set of selectable “preset” mixes of object channel and speaker channel content, each preset mix having a predetermined set of rendering parameters (e.g., spatial locations of rendered objects). In some embodiments, object related metadata of a program (or a preconfiguration of the playback or rendering system, not indicated by metadata delivered with the program) provides constraints or conditions on selectable mixes of object channel and speaker channel content.
U.S. Provisional Patent Applications No. 61/807,922 and No. 61/832,397 also describe an object based audio program including a set of bitstreams (sometimes referred to as “substreams”) which are generated and transmitted in parallel. Multiple decoders may be employed to decode them (e.g., if the program includes multiple E-AC-3 substreams the playback system may employ multiple E-AC-3 decoders to decode the substreams). Each substream may include synchronization words (e.g., time codes) to allow the substreams to be synchronized or time aligned with each other.
U.S. Provisional Patent Applications No. 61/807,922 and No. 61/832,397 also describe an object based audio program which is or includes at least one AC-3 (or E-AC-3) bitstream, and includes one or more data structures referred to as containers. Each container which includes object channel content (and/or object related metadata) is included in an auxdata field (e.g., the AUX segment shown in FIG. 1 or FIG. 4) at the end of a frame of the bitstream, or in a “skip fields” segment of the bitstream. Also described is an object based audio program which is or includes a Dolby E bitstream, in which the object channel content and object related metadata (e.g., each container of the program which includes object channel content and/or object related metadata) is included in bit locations of the Dolby E bitstream that conventionally do not carry useful information. U.S. Provisional Application No. 61/832,397 also describes an object based audio program including at least one set of speaker channels, at least one object channel, and metadata indicative of a layered graph (a layered “mix graph”) indicative of selectable mixes (e.g., all selectable mixes) of the speaker channels and object channel(s). The mix graph may be indicative of each rule applicable to selection of subsets of the speaker and object channels, is indicative of nodes (each of which may be indicative of a selectable channel or set of channels, or a category of selectable channels or set of channels) and connections between the nodes (e.g., control interfaces to the nodes and/or rules for selecting channels). The mix graph may indicate essential data (a “base” layer) and optional data (at least one “extension” layer), and where the mix graph is representable as a tree graph, the base layer can be a branch (or two or more branches) of the tree graph, and each extension layer can be another branch (or set of branches) of the tree graph.
As noted, it has been proposed to include, in an object based audio program, object related metadata which indicates rendering parameters for rendering at least one object (indicated by an object channel of the program) at an apparent spatial location or along an apparent trajectory (in a three dimensional volume) using an array of speakers. For example, an object channel of the program may have corresponding metadata indicating a three-dimensional trajectory of apparent spatial positions at which the corresponding object is to be rendered. The trajectory may include a sequence of “floor” locations in a “floor” plane (where the “floor” plane is a horizontal plane which nominally includes the expected positions of the listener's ears) of the playback environment, and a sequence of “above-floor” locations above the floor plane. It has been proposed to render an object of an object based program at above-floor locations, including by generating at least one speaker feed for driving at least one “above-floor” speaker (of a playback speaker array) which is assumed to be located above the floor plane in the playback environment. Such an above-floor speaker is sometimes referred to herein as a “height” speaker.
Traditionally, audio downmixing of a multi-channel audio program is performed in accordance with a predetermined formula, to collapse (downmix) a first set of channels of the program (N channels indicative of a first soundfield, where N is an integer) down to a second set of channels (M channels indicative of a downmixed soundfield, where M is an integer less than N) for playback by an available speaker array comprising M speakers (e.g., a stereo television speaker array consisting of two speakers). During playback after downmixing, the available speaker array emits sound indicative of the downmixed soundfield. Typically, traditional downmixing of this type includes in the second set of channels (i.e., the downmix) audio content of all the channels in the first set.
If no above-floor (“height”) speaker is present in the playback system speaker array, a traditional downmixing technique (of the type mentioned above) could be employed to downmix content of an object channel with content of speaker channels of the program (where the speaker channel content is intended to be played by floor speakers of the playback speaker array), so that the resulting downmixed sound is emitted only from floor speakers of the playback speaker array. However, the inventors have recognized that because the content of the above-floor object channel would be downmixed into the content of the original speaker channels, the traditional downmixing would undesirably result in cacophonous sound upon playback of the resulting downmix (e.g., the above-floor content would be perceived as interfering with original speaker channel content).
The inventors have also recognized that a traditional downmixing technique (of the above-mentioned type) has other limitations and disadvantages, not necessarily related to the presence or absence of height speakers in the playback speaker array. For example the inventors have recognized that, even in traditional 5.1 channel audio production, compromises are often made to preserve a reasonable soundfield for a stereo downmix For example, a broadcaster may want to place a commentary (or other dialog element) in the surround channels of a 5.1 channel program, but choose not to do so since a traditionally implemented stereo downmix of the desired representation does not provide a pleasing or representative experience to stereo television viewers.
Until the present invention, it had not been known how to render downmixes of selected channels (e.g., object and speaker channels) of an object based audio program, in a manner which ensures that the downmixes are compliant with predetermined downmixing constraints (e.g., one or more downmixing constraints specified by the entity which generates and broadcasts the program, or by the program content creator) based on playback speaker array configuration (e.g., to avoid cacophonous or otherwise undesirable downmixed sound upon playback). Different embodiments of the invention apply to any and all conditions where the program is indicative of more audio channels than those available in the final reproduction environment (i.e., all conditions in which the program includes more channels (object channels and/or speaker channels) than the number of speakers, of the playback speaker array, to be driven).