The present invention relates to a decoder for decoding a media signal and an encoder for encoding secondary media data comprising metadata or control data for primary media data.
In other words, the present invention shows a method and an apparatus for distribution of control data or metadata over a digital audio channel. An embodiment shows the convenient and reliable transmission of control data or metadata to accompany an audio signal, particularly in television plants, systems, or networks using standard AES3 (AES: audio engineering society) PCM (pulse code modulation) audio bitstreams embedded in HD-SDI (high definition serial digital interface) video signals.
In the production and transmission of music, video, and other multimedia content, the reproduction of the content can be enhanced or made more useful or valuable by including metadata describing characteristics of the content. For example, music encoded in the MP3 format has been made more useful by including ID3 tags in the MP3 file to provide information about the title or artist of the content.
In video content, it is common to include not only descriptive metadata, but data for controlling the reproduction of the content depending on the consumer's equipment and environment. For example, television broadcasts and video discs such as DVD and Blu-ray include dynamic range control data that are used to modify the loudness range of the content and downmix gains that are used to control the conversion of a surround sound multichannel audio signal for reproduction on a stereo device. In the case of dynamic range control data, gains are sent for each few milliseconds of content in order to compress the dynamic range of the content for playback in a noisy environment or where a smaller range of loudness in the program is advantageous, by optionally multiplying the final audio signal by the gains.
The means of inclusion of such metadata or control data in a digital bitstream or file for delivery to consumers is well established and specified in audio coding standards such as ATSC A/52 (standardized in Advanced Television Systems Committee, Inc. Audio Compression Standard A/52) or MPEG HE-AAC (standardized in ISO/IEC 14496-3 and ETSI TS 101 154).
However, the transmission of metadata or control data in the professional or creative environment, before the content is encoded into a final bitstream, is much less standardized. Until now this information has been primarily static in nature, remaining constant over the duration of the content. Although, loudness control gains are dynamic, in content production standard “encoding profiles” may be established to control the generation of the gains during the final audio encoding process. In this manner, no dynamic metadata may be recorded or transmitted in the content creation environment.
The development of object-oriented audio systems, where sounds in two or three dimensions are described not by levels in traditional speaker channels or Ambisonic components, but by spatial coordinates or other data describing their position and size, now involves the transmission of dynamic metadata that changes continuously, if such sounds move over time. Also, static objects are used to allow the creation of content with disparate additional audio elements, such as alternate languages, audio description for the visually impaired, or home or away team commentary for sporting events. Content with such static objects no longer fits into a uniform model of channels, such as stereo or 5.1 surround, which professional facilities are currently designed to accommodate. Thus, descriptive metadata may accompany each item of content during production or distribution so that the metadata may be encoded into the audio bitstreams for emission or delivery to the consumer.
Ideally, professional content formats would simply include provisions for such position or descriptive metadata in their structure or schema. Indeed, new formats or extensions to existing formats, such as MDA or BWF-ADM have been developed for this purpose. However, such formats are not understood in most cases by legacy equipment, particularly for distribution in systems designed for live or real-time use.
In such systems, legacy standards such as AES 3, MADI, or embedded audio over SDI are common. The use of these standards is gradually being augmented or replaced by IP-based standards such as Ravenna, Dante, or AES 67. All of these standards or techniques are designed to transmit channels of PCM audio and make no provisions for sending dynamic or descriptive metadata.
One technique considered for solving this problem was to encode the audio in a “mezzanine” format using transparent-bitrate audio coding so an appropriately formatted digital bitstream also containing static metadata could be included. This bitstream was then formatted such that it could be sent as PCM coded audio data over the traditional television plant or professional infrastructure. A common implementation of this technique in the television industry is the Dolby E system, carried in a PCM AES3 audio channel according to SMPTE standard ST 337.
Dolby E allowed legacy equipment designed with four PCM audio channels to be used for the 5.1 channels needed for surround sound, and also include provisions for transmitting the “dialnorm” or integrated loudness value of the program.
Use of the Dolby E system revealed several operational shortcomings: One issue was the inclusion of sample rate conversion in many devices used to embed the PCM audio signals in the SDI infrastructure of production or distribution facilities. Sample rate conversion or resampling of the audio signal is commonly performed to insure correct phase and frequency synchronization of the audio data sampling clock with that of the video sampling clock and video synchronization signals used in the facility. Such resampling has a normally inaudible effect on a PCM audio signal, but changes the PCM sample values. Thus, an audio channel used for transmitting a Dolby E bitstream would have the bitstream corrupted by resampling.
In such cases, the resampling may be disabled and other means used to insure synchronism of the sample clocks within the facility.
Another issue was the delay introduced by the block-transform nature of the audio codec employed. The Dolby E codec used one video frame (approximately 1/30 second for interlaced ATSC video) for encoding and one video frame for decoding the signal, resulting in a two-frame delay of the audio relative to the video. This involves delaying the video signal to maintain lip-sync, introducing additional delay in the distribution infrastructure.
A third issue is the need to program SDI routing switchers to treat inputs carrying Dolby E bitstreams as data channels instead of audio signals. Although Dolby E contains a “guard band” around the video signal's vertical interval to allow routing switchers to switch to another input without loss of the Dolby E data, many routing switchers perform a cross-fade of the audio signals during such a switch to prevent audible pops or transients in normal PCM audio signals. These crossfades are of 5-20 ms in duration and corrupt the Dolby E bitstream around the switch point.
These operational limitations resulted in most TV facilities abandoning the use of Dolby E in favor of a strategy of normalizing the dialnorm level of all content upon ingest to their network, so that fixed dialnorm values and dynamic range profiles could be programmed into their emission audio encoders.
An alternative technique sometimes used in TV facilities is to insert metadata information into the SDI video signal itself in the VANC data as standardized in SMPTE standard ST 2020. Often this is combined with carriage of the metadata using the user bits of AES3. However, ordinary SDI embedding equipment does not support the extraction of this metadata from the AES stream for insertion into VANC bits.
An additional technique sometimes used is to encode dynamic control data within a PCM audio signal by inserting it into the LSBs of the audio signal. Such a technique is described in the paper “A Variable-Bit-Rate Buried-Data Channel for Compact Disc” by Oomen and has been employed in implementations of the MPEG Surround audio coding standard. However, such buried data does not survive sample rate conversion or truncation of the LSB.
A related technique is to use extra bits such as User Bits or Auxiliary Sample Bits specified in the AES3 standard as a side data channel suitable for dynamic control data. Unfortunately, many implementations of the AES3 standard discard this information.
A further limitation of the aforementioned techniques is they are intended for use in only in a technical transmission environment. If they were routed through creative equipment, such as an audio console or digital audio workstation, even if no operations were performed on the containing PCM channel, it could not be guaranteed that the data path through the console was bit-exact, as such equipment is not designed for such purposes. Even if such bit-exactness could be assured, the mere accident of touching a control fader and thus inducing a slight gain change in the PCM channel, would corrupt the signal.
Common to all these techniques are the limitations imposed by creation and transport equipment that is designed solely for the purpose of carrying PCM audio signals, without consideration for the embedding of digital control data.
Therefore, there is a need for an improved approach.