Dolby, Dolby Digital, Dolby Digital Plus, and Dolby E are trademarks of Dolby Laboratories Licensing Corporation. Dolby Laboratories provides proprietary implementations of AC-3 and E-AC-3 known as Dolby Digital and Dolby Digital Plus, respectively.
Although the invention is not limited to use in encoding audio data in accordance with the E-AC-3 (or AC-3 or Dolby E) format, or delivering, decoding or rendering E-AC-3, AC-3, or Dolby E encoded data, for convenience it will be described in embodiments in which it encodes an audio bitstream in accordance with the E-AC-3 or AC-3 or Dolby E format, and delivers, decodes, and renders such a bitstream.
A typical stream of audio data includes both audio content (e.g., one or more channels of audio content) and metadata indicative of at least one characteristic of the audio content. For example, in an AC-3 bitstream there are several audio metadata parameters that are specifically intended for use in changing the sound of the program delivered to a listening environment.
An AC-3 or E-AC-3 encoded bitstream comprises metadata and can comprise one to six channels of audio content. The audio content is audio data that has been compressed using perceptual audio coding. Details of AC-3 coding are well known and are set forth in many published references including the following:
ATSC Standard A52/A: Digital Audio Compression Standard (AC-3), Revision A, Advanced Television Systems Committee, 20 Aug. 2001; and
U.S. Pat. Nos. 5,583,962; 5,632,005; 5,633,981; 5,727,119; and 6,021,386.
Details of Dolby Digital Plus (E-AC-3) coding are set forth in, for example, “Introduction to Dolby Digital Plus, an Enhancement to the Dolby Digital Coding System,” AES Convention Paper 6196, 117th AES Convention, Oct. 28, 2004.
Details of Dolby E coding are set forth in “Efficient Bit Allocation, Quantization, and Coding in an Audio Distribution System”, AES Preprint 5068, 107th AES Conference, August 1999 and “Professional Audio Coder Optimized for Use with Video”, AES Preprint 5033, 107th AES Conference August 1999.
Each frame of an AC-3 encoded audio bitstream contains audio content and metadata for 1536 samples of digital audio. For a sampling rate of 48 kHz, this represents 32 milliseconds of digital audio or a rate of 31.25 frames per second of audio.
Each frame of an E-AC-3 encoded audio bitstream contains audio content and metadata for 256, 512, 768 or 1536 samples of digital audio, depending on whether the frame contains one, two, three or six blocks of audio data respectively. For a sampling rate of 48 kHz, this represents 5.333, 10.667, 16 or 32 milliseconds of digital audio respectively or a rate of 189.9, 93.75, 62.5 or 31.25 frames per second of audio respectively.
As indicated in FIG. 1, each AC-3 frame is divided into sections (segments), including: a Synchronization Information (SI) section which contains (as shown in FIG. 2) a synchronization word (SW) and the first of two error correction words (CRC1); a Bitstream Information (BSI) section which contains most of the metadata; six Audio Blocks (AB0 to AB5) which contain data compressed audio content (and can also include metadata); waste bits (W) which contain any unused bits left over after the audio content is compressed; an Auxiliary (AUX) information section which may contain more metadata; and the second of two error correction words (CRC2).
As indicated in FIG. 4, each E-AC-3 frame is divided into sections (segments), including: a Synchronization Information (SI) section which contains (as shown in FIG. 2) a synchronization word (SW); a Bitstream Information (BSI) section which contains most of the metadata; between one and six Audio Blocks (AB0 to AB5) which contain data compressed audio content (and can also include metadata); waste bits (W) which contain any unused bits left over after the audio content is compressed; an Auxiliary (AUX) information section which may contain more metadata; and an error correction word (CRC).
In an AC-3 (or E-AC-3) bitstream there are several audio metadata parameters that are specifically intended for use in changing the sound of the program delivered to a listening environment. One of the metadata parameters is the DIALNORM parameter, which is included in the BSI segment.
As shown in FIG. 3, the BSI segment of an AC-3 frame (or an E-AC-3 frame) includes a five-bit parameter (“DIALNORM”) indicating the DIALNORM value for the program. A five-bit parameter (“DIALNORM2”) indicating the DIALNORM value for a second audio program carried in the same AC-3 frame is included if the audio coding mode (“acmod”) of the AC-3 frame is “0”, indicating that a dual-mono or “1+1” channel configuration is in use.
The BSI segment also includes a flag (“addbsie”) indicating the presence (or absence) of additional bit stream information following the “addbsie” bit, a parameter (“addbsil”) indicating the length of any additional bit stream information following the “addbsil” value, and up to 64 bits of additional bit stream information (“addbsi”) following the “addbsil” value.
The BSI segment includes other metadata values not specifically shown in FIG. 3.
It has been proposed to include metadata of other types in audio bitstreams. For example, methods and systems for generating, decoding, and processing audio bitstreams including metadata indicative of the processing state (e.g., the loudness processing state) and characteristics (e.g., loudness) of audio content are described in PCT International Application Publication Number WO 2012/075246 A2, having international filing date Dec. 1, 2011, and assigned to the assignee of the present application. This reference also describes adaptive processing of the audio content of the bitstreams using the metadata, and verification of validity of the loudness processing state and loudness of audio content of the bitstreams using the metadata.
Methods for generating and rendering object based audio programs are also known. During generation of such programs, it may be assumed that the loudspeakers to be employed for rendering are located in arbitrary locations in the playback environment (or that the speakers are in a symmetric configuration in a unit circle). It need not be assumed that the speakers are necessarily in a (nominally) horizontal plane or in any other predetermined arrangements known at the time of program generation. Typically, metadata included in the program indicates rendering parameters for rendering at least one object of the program at an apparent spatial location or along a trajectory (in a three dimensional volume), e.g., using a three-dimensional array of speakers. For example, an object channel of the program may have corresponding metadata indicating a three-dimensional trajectory of apparent spatial positions at which the object (indicated by the object channel) is to be rendered. The trajectory may include a sequence of “floor” locations (in the plane of a subset of speakers which are assumed to be located on the floor, or in another horizontal plane, of the playback environment), and a sequence of “above-floor” locations (each determined by driving a subset of the speakers which are assumed to be located in at least one other horizontal plane of the playback environment). Examples of rendering of object based audio programs are described, for example, in PCT International Application No. PCT/US2001/028783, published under International Publication No. WO 2011/119401 A2 on Sep. 29, 2011, and assigned to the assignee of the present application.
Above-cited U.S. Provisional Patent Application No. 61/807,922 and above-cited U.S. Provisional Patent Application No. 61/832,397 describe object based audio programs which are rendered so as to provide an immersive, personalizable perception of the program's audio content. The content may be indicative of the atmosphere at (i.e., sound occurring in or at) and/or commentary on a spectator event (e.g., a soccer or rugby game, or another sporting event). The audio content of the program may be indicative of multiple audio object channels (e.g., indicative of user-selectable objects or object sets, and typically also a default set of objects to be rendered in the absence of object selection by the user) and at least one bed of speaker channels. The bed of speaker channels may be a conventional mix (e.g., a 5.1 channel mix) of speaker channels of a type that might be included in a conventional broadcast program which does not include an object channel.
Above-cited U.S. Provisional Patent Applications No. 61/807,922 and No. 61/832,397 describe object related metadata delivered as part of an object based audio program which provides mixing interactivity (e.g., a large degree of mixing interactivity) on the playback side, including by allowing an end user to select a mix of audio content of the program for rendering, instead of merely allowing playback of a pre-mixed sound field. For example, a user may select among rendering options provided by metadata of a typical embodiment of the inventive program to select a subset of available object channels for rendering, and optionally also the playback level of at least one audio object (sound source) indicated by the object channel(s) to be rendered. The spatial location at which each selected sound source is rendered may be predetermined by metadata included in the program, but in some embodiments can be selected by the user (e.g., subject to predetermined rules or constraints). In some embodiments, metadata included in the program allows user selection from among a menu of rendering options (e.g., a small number of rendering options, for example, a “home team crowd noise” object, a “home team crowd noise” and a “home team commentary” object set, an “away team crowd noise” object, and an “away team crowd noise” and “away team commentary” object set). The menu may be presented to the user by a user interface of a controller, and the controller may be coupled to a set top device (or other device) configured to decode and render (at least partially) the object based program. Metadata included in the program may otherwise allow user selection from among a set of options as to which object(s) indicated by the object channels should be rendered, and as to how the object(s) to be rendered should be configured.
U.S. Provisional Patent Applications No. 61/807,922 and No. 61/832,397 describe an object based audio program which is an encoded audio bitstream indicative of at least some of the program's audio content (e.g., a bed of speaker channels and at least some of the program's object channels) and object related metadata. At least one additional bitstream or file may be indicative of some of the program's audio content (e.g., at least some of the object channels) and/or object related metadata. In some embodiments, object related metadata provides a default mix of object content and bed (speaker channel) content, with default rendering parameters (e.g., default spatial locations of rendered objects). In some embodiments, object related metadata provides a set of selectable “preset” mixes of object channel and speaker channel content, each preset mix having a predetermined set of rendering parameters (e.g., spatial locations of rendered objects). In some embodiments, object related metadata of a program (or a preconfiguration of the playback or rendering system, not indicated by metadata delivered with the program) provides constraints or conditions on selectable mixes of object channel and speaker channel content.
U.S. Provisional Patent Applications No. 61/807,922 and No. 61/832,397 also describe an object based audio program including a set of bitstreams (sometimes referred to as “substreams”) which are generated and transmitted in parallel. Multiple decoders may be employed to decode them (e.g., if the program includes multiple E-AC-3 substreams the playback system may employ multiple E-AC-3 decoders to decode the substreams). Each substream may include synchronization words (e.g., time codes) to allow the substreams to be synchronized or time aligned with each other.
U.S. Provisional Patent Applications No. 61/807,922 and No. 61/832,397 also describe an object based audio program which is or includes at least one AC-3 (or E-AC-3) bitstream, and includes one or more data structures referred to as containers. Each container which includes object channel content (and/or object related metadata) is included in an auxdata field (e.g., the AUX segment shown in FIG. 1 or FIG. 4) at the end of a frame of the bitstream, or in a “skip fields” segment of the bitstream. Also described is an object based audio program which is or includes a Dolby E bitstream, in which the object channel content and object related metadata (e.g., each container of the program which includes object channel content and/or object related metadata) is included in bit locations of the Dolby E bitstream that conventionally do not carry useful information. U.S. Provisional Application No. 61/832,397 also describes an object based audio program including at least one set of speaker channels, at least one object channel, and metadata indicative of a layered graph (a layered “mix graph”) indicative of selectable mixes (e.g., all selectable mixes) of the speaker channels and object channel(s). The mix graph may be indicative of each rule applicable to selection of subsets of the speaker and object channels, is indicative of nodes (each of which may be indicative of a selectable channel or set of channels, or a category of selectable channels or set of channels) and connections between the nodes (e.g., control interfaces to the nodes and/or rules for selecting channels). The mix graph may indicate essential data (a “base” layer) and optional data (at least one “extension” layer), and where the mix graph is representable as a tree graph, the base layer can be a branch (or two or more branches) of the tree graph, and each extension layer can be another branch (or set of branches) of the tree graph.
U.S. Provisional Applications Nos. 61/807,922 and 61/832,397 also teach that an object based audio program may be decodable, and speaker channel content thereof may be renderable, by a legacy decoder and rendering system (which is not configured to parse object channels and object related metadata of the program). The same program may be rendered by a set top device (or other decoding and rendering system) which is configured to parse the program's object channels and object related metadata and render a mix of speaker channel and object channel content indicated by the program. However, neither U.S. Provisional Application No. 61/807,922 nor U.S. Provisional Application No. 61/832,397 teaches or suggests how to generate a personizable object based audio program which can be rendered by a legacy decoding and rendering system (which is not configured to parse object channels and object related metadata of the program) to provide a full range audio experience (e.g., audio intended to be perceived as non-ambient sound from at least one discrete audio object, mixed with ambient sound), but so that a decoding and rendering system which is configured to parse the program's object channels and object related metadata may render a selected mix (also providing a full range audio experience) of content of at least one speaker channel and at least one object channel of the program, or that it would be desirable to do so.