An audio scene comprises a multi dimensional environment in which different sounds occur at various times and positions. An example of an audio scene may be a crowded room, a restaurant, a forest scene, a busy street or any indoor or outdoor environment where sound occurs at different positions and times.
Audio scenes can be recorded as audio data, using directional microphone arrays or other like means. FIG. 1 provides an example of a recording arrangement for an audio scene, wherein the audio space consists of N devices that are arbitrarily positioned within the audio space to record the audio scene. The captured signals are then transmitted (or alternatively stored for later consumption) to the rendering side where the end user can select the listening point based on his/her preference from the reconstructed audio space. The rendering part then provides a downmixed signal from the multiple recordings that correspond to the selected listening point. In FIG. 1, the microphones of the devices are shown to have a directional beam, but the concept is not restricted to this and embodiments of the invention may use microphones having any form of suitable beam. Furthermore, the microphones do not necessarily employ a similar beam, but microphones with different beams may be used. The downmixed signal may be a mono, stereo, binaural signal or it may consist of multiple channels.
Audio zooming refers to a concept, where an end-user has the possibility to select a listening position within an audio scene and listen to the audio related to the selected position instead of listening to the whole audio scene. However, throughout a typical audio scene the audio signals from the plurality of audio sources are more or less mixed up with each other, possibly resulting in noise-like sound effect, while on the other hand there are typically only a few listening positions in an audio scene, wherein a meaningful listening experience with distinctive audio sources can be achieved. Unfortunately, so far there has been no technical solution for identifying these listening positions, and therefore the end-user has to find a listening position providing a meaningful listening experience on trial-and-error basis, thus possibly giving a compromised user experience.