The invention relates to the control of the loudness of audio, video, and multimedia content played back in digital form on electronic reproduction devices, specifically but not exclusively to the control of the playback loudness with content that is prepared both with and without embedded loudness metadata as commonly occurs in new media devices.
In the production and transmission of music, video, and other multimedia content, the process of loudness normalization is carried out to ensure that the consumer hears the audio signal with an appropriate loudness from song to song or program to program.
Since the early days of recording and films, this has been done during the production process or through reproduction standards for theaters. The common practice today in the music and radio broadcasting industries is to adjust the loudness to a value near the maximum peak level of the medium, while the practice in the film or television industries is to use one of several standard loudness levels that may be 20 to 31 dB below the maximum peak level. In the era before media convergence, this was unnoticed by consumers as separate devices or volume settings were used to playback each type of content.
With the advent of mobile devices such as mobile phones or portable media players that are intended to playback both music and film content, this difference in production practices leads to loudness differences that may be as much as 30 dB, if the content is transmitted to the device without modification. This can lead to movies that are too quiet, or music that is too loud, when switching from one type of content to another.
A related trend is the increase in loudness of many genres of recorded music through the use of strong dynamic range compression, limiting, and clipping during the mastering of a recording. Such mastering is done considering only lossless recording media such as Compact Discs, though the majority of music sold today is in lossy data-compressed formats such as MPEG AAC and MP3. The data compression process may introduce changes in the time-domain waveform reconstructed in the decoder during playback that cause overshoots in the waveform above the full-scale limits or maximum peak value of the signal. In a fixed-point decoder (or saturating floating-point decoder) typically used in mobile devices, this can lead to clipping of the overshoot to the full-scale limit, causing additional audible clipping in the reproduced signal.
This strong compression and clipping of music is done in some cases for artistic purposes, but is more commonly done either as an attempt to increase the commercial appeal of a recording by making it “sound louder” than others, or to provide content that can be understood in all listening circumstances, such as in airports or noisy places as well as quiet environments.
In the film and video industries, wide audio dynamic range is used in some genres for dramatic effect and to create a more engaging experience. When conveyed to a consumer through the Dolby Digital or MPEG-4 AAC codecs, audio dynamic range control metadata is often included to allow the dynamic range to be optionally reduced at the receiver or player for cases where there is a noisy environment or where loud scenes would be too disturbing.
The traditional metadata included in DVD or BluRay content encoded with Dolby Digital or transmitted in TV signals encoded with Dolby Digital (standardized in Advanced Television Systems Committee, Inc. Audio Compression Standard A/52) or MPEG-4 AAC (standardized in ISO/IEC 14496-3 and ETSI TS 101 154) includes the following components:
1. A single, static metadata value indicating the overall long-term integrated loudness of the program, termed program reference level in the MPEG standards.
2. Static metadata values for downmix gains used to control the down-mixing of multi-channel content for output through a stereo or monophonic device.
3. Two sets of dynamic range control gains or scaling factors, sent for each data-compressed bitstream frame for a plurality of frequency bands or regions in the audio signal. One is used for “light” compression in the industry vernacular and the other for “heavy” compression. The use of these light and heavy DRC values is typically tied to operation at decoder loudness target levels established for the operating modes “Line Mode” and “RF Mode”. The naming conventions and operation points for these modes were established in the early days of digital media when it might have been necessary to convert digital audio to analog signals sent over baseband cables to line inputs on a succeeding device or transmitted over an RF carrier to an analog television set.
The use of this metadata allows the reproduction to be tailored to the listening environment in a non-destructive manner during playback. The same stream or file may be played back with a different set of metadata, or no metadata used at all, to produce a different dynamic range. Unlike the use of a compressor that resides solely in the playback device, dynamic range control using metadata allows monitoring and control of the nature of the compression by creative artists during the production process, if desired.
Unfortunately, dynamic range control metadata as commonly implemented in lossy codecs such as MPEG AAC or the Dolby Digital family cannot compress a signal strongly enough to match the loudness of contemporary music, as the metadata affects the average power of the signal (potentially in several frequency bands) on an audio compression frame basis, with common frame periods of 20-40 ms. This frame-by-frame gain control is not quick enough to reduce the peak to average ratio of the signal to that of highly processed contemporary music.
The approach taken by Wolters et al as described in [5] to solve this problem is to employ an audio limiter following the decoder in a playback device to increase the average loudness. This will solve the loudness matching issue, so that music and film content have equal loudness, but has several disadvantages. When a consumer is playing content in a quiet environment, perhaps with the mobile device connected to speakers in a quiet room or using headphones or earphones with strong acoustic isolation, the film content will be undesirably compressed as strongly as the music. Also, the limiter introduces additional workload on the device CPU or DSP, shortening battery life.
A different approach is described by Camerer et al in [6] which proposes encoding a loudness measurement such as described in ITU Standard BS.1770-2 as metadata in music files and normalizing the playback of each file to a target level set by the device's volume control. This builds upon previous systems of music loudness normalization such as SoundCheck (www.apple.com) and ReplayGain (www.replaygain.org), which have been optional features of some music players such as the iPod. In their approach, they advocate mandating loudness normalization as on by default; however, they do not specify what is to happen when a user turns off the loudness normalization, or more importantly, what happens when content which has not been encoded with loudness metadata is played back. Their assumption is that all content will be analyzed by the playback device or by a secure trusted distributor such as iTunes before playback. Additionally, there is no provision for adjusting the overall dynamic range of the content to tailor it to the listening environment.
Therefore, it is an object of the invention to provide a unified approach to the problem of normalizing playback loudness of both film/video style content, with potentially wide dynamic range and possible embedded loudness metadata, and music or radio/podcast content, with potentially extremely narrow dynamic range and strong compression, limiting, and clipping, potentially, but likely not containing embedded loudness metadata, due to the vast amount of prior music content already held or exchanged by consumers.
It is another object of this invention to allow the dynamic range of content containing dynamic range control metadata to be adjusted to the consumer's listening environment or taste.
A further object of this invention is to prevent potential clipping in lossy data-compression audio decoders, such as an AAC, MP3, or Dolby Digital decoder, caused by the changes in signal components introduced by the data compression process.
A further object of this invention is to provide a mild incentive for the music recording industry to abandon pursuit of ever-stronger dynamic range compression, limiting, and clipping in their content.
Still another object of this invention is to limit the additional workload on the device CPU or DSP caused by loudness processing or clipping prevention.