Dolby, Dolby Digital, Dolby Digital Plus, and Dolby E are trademarks of Dolby Laboratories Licensing Corporation. Dolby Laboratories provides proprietary implementations of AC-3 and E-AC-3 known as Dolby Digital and Dolby Digital Plus, respectively.
A typical stream of audio data includes both audio content (e.g., one or more channels of audio content) and metadata indicative of at least one characteristic of the audio content. For example, in an AC-3 bitstream there are several audio metadata parameters that are specifically intended for use in changing the sound of the program delivered to a listening environment. One of the metadata parameters is the DIALNORM parameter, which is intended to indicate the mean level of dialog occurring in an audio program, and is used to determine audio playback signal level.
Although the present invention is not limited to use with an AC-3 bitstream, an E-AC-3 bitstream, or a Dolby E bitstream, for convenience it will be described in embodiments in which it generates, decodes, or otherwise processes such a bitstream which includes loudness processing state metadata.
An AC-3 encoded bitstream comprises metadata and one to six channels of audio content. The audio content is audio data that has been compressed using perceptual audio coding. The metadata includes several audio metadata parameters that are intended for use in changing the sound of a program delivered to a listening environment.
Details of AC-3 (also known as Dolby Digital) coding are well known and are set forth in many published references including in ATSC Standard A52/A: Digital Audio Compression Standard (AC-3), Revision A, Advanced Television Systems Committee, 20 Aug. 2001.
Details of Dolby Digital Plus (E-AC-3) coding are set forth in “Introduction to Dolby Digital Plus, an Enhancement to the Dolby Digital Coding System,” AES Convention Paper 6196, 117th AES Convention, Oct. 28, 2004.
Details of Dolby E coding are set forth in “Efficient Bit Allocation, Quantization, and Coding in an Audio Distribution System”, AES Preprint 5068, 107th AES Conference, August 1999 and “Professional Audio Coder Optimized for Use with Video”, AES Preprint 5033, 107th AES Conference August 1999.
Each frame of an AC-3 encoded audio bitstream contains audio content and metadata for 1536 samples of digital audio. For a sampling rate of 48 kHz, this represents 32 milliseconds of digital audio or a rate of 31.25 frames per second of audio.
Each frame of an E-AC-3 encoded audio bitstream contains audio content and metadata for 256, 512, 768 or 1536 samples of digital audio, depending on whether the frame contains one, two, three or six blocks of audio data respectively. For a sampling rate of 48 kHz, this represents 5.333, 10.667, 16 or 32 milliseconds of digital audio respectively or a rate of 189.9, 93.75, 62.5 or 31.25 frames per second of audio respectively.
As indicated in FIG. 1, each AC-3 frame is divided into sections (segments), including: a Synchronization Information (SI) section which contains (as shown in FIG. 2) a synchronization word (SW) and the first of two error correction words (CRC1); a Bitstream Information (BSI) section which contains most of the metadata; six Audio Blocks (AB0 to AB5) which contain data compressed audio content (and can also include metadata); waste bits (W) which contain any unused bits left over after the audio content is compressed; an Auxiliary (AUX) information section which may contain more metadata; and the second of two error correction words (CRC2).
As indicated in FIG. 4, each E-AC-3 frame is divided into sections (segments), including: a Synchronization Information (SI) section which contains (as shown in FIG. 2) a synchronization word (SW); a Bitstream Information (BSI) section which contains most of the metadata; between one and six Audio Blocks (AB0 to AB5) which contain data compressed audio content (and can also include metadata); waste bits (W) which contain any unused bits left over after the audio content is compressed; an Auxiliary (AUX) information section which may contain more metadata; and an error correction word (CRC).
In an AC-3 (or E-AC-3) bitstream there are several audio metadata parameters that are specifically intended for use in changing the sound of the program delivered to a listening environment. One of the metadata parameters is the DIALNORM parameter, which is included in the BSI segment.
As shown in FIG. 3, the BSI segment of an AC-3 frame (or an E-AC-3 frame) includes a five-bit parameter (“DIALNORM”) indicating the DIALNORM value for the program. A five-bit parameter (“DIALNORM2”) indicating the DIALNORM value for a second audio program carried in the same AC-3 frame is included if the audio coding mode (“acmod”) of the AC-3 frame is “0”, indicating that a dual-mono or “1+1” channel configuration is in use.
The BSI segment also includes a flag (“addbsie”) indicating the presence (or absence) of additional bit stream information following the “addbsie” bit, a parameter (“addbsil”) indicating the length of any additional bit stream information following the “addbsil” value, and up to 64 bits of additional bit stream information (“addbsi”) following the “addbsil” value.
The BSI segment includes other metadata values not specifically shown in FIG. 3.
It has been proposed to include metadata of other types in audio bitstreams. For example, methods and systems for generating, decoding, and processing audio bitstreams including metadata indicative of the processing state (e.g., the loudness processing state) and characteristics (e.g., loudness) of audio content are described in PCT International Application Publication Number WO 2012/075246 A2, having international filing date Dec. 1, 2011, and assigned to the assignee of the present application. This reference also describes adaptive processing of the audio content of the bitstreams using the metadata, and verification of validity of the loudness processing state and loudness of audio content of the bitstreams using the metadata.
Methods for generating and rendering object based audio programs are also known. During generation of such programs, it is typically assumed that the loudspeakers to be employed for rendering are located in arbitrary locations in the playback environment; not necessarily in a (nominally) horizontal plane or in any other predetermined arrangements known at the time of program generation. Typically, metadata included in the program indicates rendering parameters for rendering at least one object of the program at an apparent spatial location or along a trajectory (in a three dimensional volume), e.g., using a three-dimensional array of speakers. For example, an object channel of the program may have corresponding metadata indicating a three-dimensional trajectory of apparent spatial positions at which the object (indicated by the object channel) is to be rendered. The trajectory may include a sequence of “floor” locations (in the plane of a subset of speakers which are assumed to be located on the floor, or in another horizontal plane, of the playback environment), and a sequence of “above-floor” locations (each determined by driving a subset of the speakers which are assumed to be located in at least one other horizontal plane of the playback environment). Examples of rendering of object based audio programs are described, for example, in PCT International Application No. PCT/US2001/028783, published under International Publication No. WO 2011/119401 A2 on Sep. 29, 2011, and assigned to the assignee of the present application.