The present invention relates to audio signal encoding, processing and decoding, and, in particular, to a decoder, an encoder and method for informed loudness estimation in object-based audio coding systems.
Recently, parametric techniques for bitrate-efficient transmission/storage of audio scenes comprising multiple audio object signals have been proposed in the field of audio coding (see C. Faller and F. Baumgarte, “Binaural Cue Coding—Part II: Schemes and applications,” IEEE Trans. on Speech and Audio Proc., vol. 11, no. 6, November 2003; C. Faller, “Parametric Joint-Coding of Audio Sources,” 120th AES Convention, Paris, 2006; ISO/IEC, “MPEG audio technologies—Part 2: Spatial Audio Object Coding (SAOC),” ISO/IEC JTC1/SC29/WG11 (MPEG) International Standard 23003-2; J. Herre, S. Disch, J. Hilpert, O. Hellmuth: “From SAC To SAOC—Recent Developments in Parametric Coding of Spatial Audio,” 22nd Regional UK AES Conference, Cambridge, UK, April 2007; and J. Engdegård, B. Resch, C. Falch, O. Hellmuth, J. Hilpert, A. Hölzer, L. Terentiev, J. Breebaart, J. Koppens, E. Schuijers and W. Oomen: “Spatial Audio Object Coding (SAOC)—The Upcoming MPEG Standard on Parametric Object Based Audio Coding,” 124th AES Convention, Amsterdam 2008) and informed source separation (M. Parvaix and L. Girin: “Informed Source Separation of underdetermined instantaneous Stereo Mixtures using Source Index Embedding,” IEEE ICASSP, 2010; M. Parvaix, L. Girin, J.-M. Brossier: “A watermarking-based method for informed source separation of audio signals with a single sensor,” IEEE Transactions on Audio, Speech and Language Processing, 2010; A. Liutkus and J. Pinel and R. Badeau and L. Girin and G. Richard: “Informed source separation through spectrogram coding and data embedding,” Signal Processing Journal, 2011; A. Ozerov, A. Liutkus, R. Badeau, G. Richard: “Informed source separation: source coding meets source separation,” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2011; S. Zhang and L. Girin: “An Informed Source Separation System for Speech Signals,” INTERSPEECH, 2011; and L. Girin and J. Pinel: “Informed Audio Source Separation from Compressed Linear Stereo Mixtures,” AES 42nd International Conference: Semantic Audio, 2011). These techniques aim at reconstructing a desired output audio scene or audio source object based on additional side information describing the transmitted/stored audio scene and/or source objects in the audio scene. This reconstruction takes place in the decoder using an informed source separation scheme. The reconstructed objects may be combined to produce the output audio scene. Depending on the way the objects are combined, the perceptual loudness of the output scene may vary.
In TV and radio broadcast, the volume levels of the audio tracks of various programs may be normalized based on various aspects, such as the peak signal level or the loudness level. Depending on the dynamic properties of the signals, two signals with the same peak level may have a widely differing level of perceived loudness. Now switching between programs or channels the differences in the signal loudness are very annoying and have been to be a major source for end-user complaints in broadcast.
In conventional technology, it has been proposed to normalize all the programs on all channels similarly to a common reference level using a measure based on perceptual signal loudness. One such recommendation in Europe is the EBU Recommendation R128 (EBU Recommendation R 128 “Loudness normalization and permitted maximum level of audio signals,” Geneva, 2011—later referred to as “R128”).
The recommendation says that the “program loudness”, e.g., the average loudness over one program (or one commercial, or some other meaningful program entity) should equal a specified level (with small allowed deviations). When more and more broadcasters comply with this recommendation and the necessitated normalization, the differences in the average loudness between programs and channels should be minimized.
Loudness estimation can be performed in several ways. There exist several mathematical models for estimating the perceptual loudness of an audio signal. The EBU recommendation R128 relies on the model presented in ITU-R BS.1770 (see International Telecommunication Union: “Recommendation ITU-R BS.1770-3—Algorithms to measure audio programme loudness and true-peak audio level,” Geneva, 2012 for the loudness estimation—later referred to as “BS.1770”).
As stated before, e.g., according to the EBU Recommendation R128, the program loudness, e.g., the average loudness over one program should equal a specified level with small allowed deviations. However, this leads to significant problems when audio rendering is conducted, unsolved until now in conventional technology. Conducting audio rendering on the decoder side has a significant effect on the overall/total loudness of the received audio input signal. However, despite scene rendering is conducted, the total loudness of the received audio signal shall remain the same.
Currently, no specific decoder-side solution exists for this problem.
EP Patent No. 2 146 522 A1 relates to concepts for generating audio output signals using object based metadata. At least one audio output signal is generated representing a superposition of at least two different audio object signals, but does not provide a solution for this problem.
PCT Publication No. WO 2008/035275 A2 describes an audio system comprising an encoder which encodes audio objects in an encoding unit that generates a down-mix audio signal and parametric data representing the plurality of audio objects. The down-mix audio signal and parametric data is transmitted to a decoder which comprises a decoding unit which generates approximate replicas of the audio objects and a rendering unit which generates an output signal from the audio objects. The decoder furthermore contains a processor for generating encoding modification data which is sent to the encoder. The encoder then modifies the encoding of the audio objects, and in particular modifies the parametric data, in response to the encoding modification data. The approach allows manipulation of the audio objects to be controlled by the decoder but performed fully or partly by the encoder. Thus, the manipulation may be performed on the actual independent audio objects rather than on approximate replicas thereby providing improved performance.
EP Patent No. 2 146 522 A1 discloses an apparatus for generating at least one audio output signal representing a superposition of at least two different audio objects comprises a processor for processing an audio input signal to provide an object representation of the audio input signal, where this object representation can be generated by a parametrically guided approximation of original objects using an object downmix signal. An object manipulator individually manipulates objects using audio object based metadata referring to the individual audio objects to obtain manipulated audio objects. The manipulated audio objects are mixed using an object mixer for finally obtaining an audio output signal having one or several channel signals depending on a specific rendering setup.
PCT Publication No. WO 2008/046531 A1 describes an audio object coder for generating an encoded object signal using a plurality of audio objects includes a downmix information generator for generating downmix information indicating a distribution of the plurality of audio objects into at least two downmix channels, an audio object parameter generator for generating object parameters for the audio objects, and an output interface for generating the imported audio output signal using the downmix information and the object parameters. An audio synthesizer uses the downmix information for generating output data usable for creating a plurality of output channels of the predefined audio output configuration.
It would be desirable to have an accurate estimate of the output average loudness or the change in the average loudness without a delay and when the program does not change or the rendering scene is not changed, the average loudness estimate should also remain static.