The present invention relates to audio signal processing, and, in particular, to an apparatus and a method for realizing an enhanced downmix, in particular, for realizing enhanced guided downmix capabilities for 3D audio.
An increasing number of loudspeakers is used for a spatial reproduction of sound. While legacy surround sound reproduction (e.g. 5.1) was limited to a single plane, new channel formats with elevated speakers have been introduced in the context of 3D audio reproduction.
The signals to be reproduced over the loudspeakers used to be directly related to the particular speakers and were stored and transmitted discretely or parametrically. It can be said that for this kind of formats, that they are related to a clearly defined number and position of loudspeakers of the sound reproduction system. Accordingly, it is necessitated to consider a particular reproduction format before transmission or storage of an audio signal.
Nevertheless, there are already some exceptions from this principle. For example, multi-channel audio signals (e.g. five surround audio channels or e.g., 5.1 surround audio channels) have to be down-mixed for reproduction over two-channel stereo loudspeaker setups. Rules exist how to reproduce five surround channels on two loudspeakers of a stereo system.
Moreover, when stereo channels were introduced, a rule existed how to reproduce the audio content of the two stereo channels by a single mono loudspeaker.
Since the number of formats and thus the possibilities how loudspeakers are positioned have increased, it will be nearly impossible to consider the loudspeaker setup of the reproduction system before transmission or storage. Accordingly, it will be necessitated to adapt the incoming audio signals to the actual loudspeaker setup.
Different methods can be used for downmixing from surround sound to two-channel stereo. The still widely used time-domain downmix with static downmix coefficients is often referred to as ITU downmix [5]. Other time-domain downmixing approaches—partly with dynamic adjustment of the downmix coefficients—are employed in the encoders of matrix surround techniques [6], [7].
In [3], it is disclosed that direct sound sources mixed to the rear channels folded-down into the two-channel stereo panorama might not be distinguishable due to masking or otherwise mask other sound sources.
In the course of the development of spatial audio coding (SAC) technologies, frequency-selective downmix algorithms were introduced as part of the encoder [8], [9]. Particularly, sound colorizations can be reduced and the level balancing and stability of sound source localization is maintained by applying energy equalization to the resulting audio channels. Energy equalization is also performed in other downmixing systems [9], [10], [12].
For the case that the rear channels only contain ambient sound like reverberance, the reduction of ambience (reverberance, spaciousness) is solved in the ITU downmix [5] by attenuating the rear channels of the multi-channel signal. If rear channels also contain direct sound, this attenuation is not appropriate since direct parts of the rear channel would be attenuated as well in the downmix. Therefore, a more sophisticated ambience attenuation algorithm is appreciated.
Audio codecs like AC-3 and HE-AAC provide means to transmit so-called metadata alongside the audio stream, including downmixing coefficients for the downmix from five to two audio channels (stereo). The amount of selected audio channels (center, rear channels) in the resulting stereo signal is controlled by transmitted gain values. Although these coeffients can be time-variant they remain usually constant for the duration of one item of a program.
The solution used in the “Logic7” matrix system introduced a signal adaptive approach which attenuates the rear channels only if they are considered to be fully ambient. This is achieved by comparing the power of the front channels to the power of the rear channels. The assumption of this approach is that if the rear channels solely contain ambience, they have significantly less power than the front channels. The more power the front channels have compared to the rear channels, the more the rear channels are attenuated in the downmixing process. This assumption may be true for some surround productions especially with classical content but this assumption is not true for various other signals.
It would therefore be highly appreciated, if improved concepts for audio signal processing would be provided.