The present invention relates to audio signal encoding, processing and decoding, and, in particular, to a decoder, an encoder and method for informed loudness estimation in object-based audio coding systems.
Recently, parametric techniques for bitrate-efficient transmission/storage of audio scenes comprising multiple audio object signals have been proposed in the field of audio coding [BCC, JSC, SAOC, SAOC1, SAOC2] and informed source separation [ISS1, ISS2, ISS3, ISS4, ISS5, ISS6]. These techniques aim at reconstructing a desired output audio scene or audio source object based on additional side information describing the transmitted/stored audio scene and/or source objects in the audio scene. This reconstruction takes place in the decoder using an informed source separation scheme. The reconstructed objects may be combined to produce the output audio scene. Depending on the way the objects are combined, the perceptual loudness of the output scene may vary.
In TV and radio broadcast, the volume levels of the audio tracks of various programs may be normalized based on various aspects, such as the peak signal level or the loudness level. Depending on the dynamic properties of the signals, two signals with the same peak level may have a widely differing level of perceived loudness. Now switching between programs or channels the differences in the signal loudness are very annoying and have been to be a major source for end-user complaints in broadcast.
In the known technology, it has been proposed to normalize all the programs on all channels similarly to a common reference level using a measure based on perceptual signal loudness. One such recommendation in Europe is the EBU Recommendation R128 [EBU] (later referred to as R128).
The recommendation says that the “program loudness”, e.g., the average loudness over one program (or one commercial, or some other meaningful program entity) should equal a specified level (with small allowed deviations). When more and more broadcasters comply with this recommendation and the necessitated normalization, the differences in the average loudness between programs and channels should be minimized.
Loudness estimation can be performed in several ways. There exist several mathematical models for estimating the perceptual loudness of an audio signal. The EBU recommendation R128 relies on the model presented in ITU-R BS.1770 (later referred to as BS.1770) (see [ITU]) for the loudness estimation.
As stated before, e.g., according to the EBU Recommendation R128, the program loudness, e.g., the average loudness over one program should equal a specified level with small allowed deviations. However, this leads to significant problems when audio rendering is conducted, unsolved until now in the known technology. Conducting audio rendering on the decoder side has a significant effect on the overall/total loudness of the received audio input signal. However, despite scene rendering is conducted, the total loudness of the received audio signal shall remain the same.
Currently, no specific decoder-side solution exists for this problem.
EP 2 146 522 A1 ([EP]), relates to concepts for generating audio output signals using object based metadata. At least one audio output signal is generated representing a superposition of at least two different audio object signals, but does not provide a solution for this problem.
WO 2008/035275 A2 ([BRE]) describes an audio system comprising an encoder which encodes audio objects in an encoding unit that generates a down-mix audio signal and parametric data representing the plurality of audio objects. The down-mix audio signal and parametric data is transmitted to a decoder which comprises a decoding unit which generates approximate replicas of the audio objects and a rendering unit which generates an output signal from the audio objects. The decoder furthermore contains a processor for generating encoding modification data which is sent to the encoder. The encoder then modifies the encoding of the audio objects, and in particular modifies the parametric data, in response to the encoding modification data. The approach allows manipulation of the audio objects to be controlled by the decoder but performed fully or partly by the encoder. Thus, the manipulation may be performed on the actual independent audio objects rather than on approximate replicas thereby providing improved performance.
EP 2 146 522 A1 ([SCH]) discloses an apparatus for generating at least one audio output signal representing a superposition of at least two different audio objects comprises a processor for processing an audio input signal to provide an object representation of the audio input signal, where this object representation can be generated by a parametrically guided approximation of original objects using an object downmix signal. An object manipulator individually manipulates objects using audio object based metadata referring to the individual audio objects to obtain manipulated audio objects. The manipulated audio objects are mixed using an object mixer for finally obtaining an audio output signal having one or several channel signals depending on a specific rendering setup.
WO 2008/046531 A1 ([ENG]) describes an audio object coder for generating an encoded object signal using a plurality of audio objects includes a downmix information generator for generating downmix information indicating a distribution of the plurality of audio objects into at least two downmix channels, an audio object parameter generator for generating object parameters for the audio objects, and an output interface for generating the imported audio output signal using the downmix information and the object parameters. An audio synthesizer uses the downmix information for generating output data usable for creating a plurality of output channels of the predefined audio output configuration.
It would be desirable to have an accurate estimate of the output average loudness or the change in the average loudness without a delay and when the program does not change or the rendering scene is not changed, the average loudness estimate should also remain static.