The present invention relates to audio signal processing, and, in particular, to a system, an apparatus and a method for consistent acoustic scene reproduction based on informed spatial filtering.
In spatial sound reproduction the sound at the recording location (near-end side) is captured with multiple microphones and then reproduced at the reproduction side (far-end side) using multiple loudspeakers or headphones. In many applications, it is desired to reproduce the recorded sound such that the spatial image recreated at the far-end side is consistent with the original spatial image at the near-end side. This means for instance that the sound of the sound sources is reproduced from the directions where the sources were present in the original recording scenario. Alternatively, when for instance a video is complimenting the recorded audio, it is desirable that the sound is reproduced such that the recreated acoustical image is consistent with the video image. This means for instance that the sound of a sound source is reproduced from the direction where the source is visible in the video. Additionally, the video camera may be equipped with a visual zoom function or the user at the far-end side may apply a digital zoom to the video which would change the visual image. In this case, the acoustical image of the reproduced spatial sound should change accordingly. In many cases, the far-end side determines the spatial image to which the reproduced sound should be consistent is determined either at the far end side or during play back, for instance when a video image is involved. Consequently, the spatial sound at the near-end side is recorded, processed, and transmitted such that at the far-end side we can still control the recreated acoustical image.
The possibility to reproduce a recorded acoustical scene consistently with a desired spatial image is necessitated in many modern applications. For instance modern consumer devices such as digital cameras or mobile phones are often equipped with a video camera and multiple microphones. This enables to record videos together with spatial sound, e.g., stereo sound. When reproducing the recorded audio together with the video, it is desired that the visual and acoustical image are consistent. When the user zooms in with the camera, it is desirable to recreate the visual zooming effect acoustically so that the visual and acoustical images are aligned when watching the video. For instance, when the user zooms in on a person, the voice of this person should become less reverberant as the person appears to be closer to the camera. Moreover, the voice of the person should be reproduced from the same direction where the person appears in the visual image. Mimicking the visual zoom of a camera acoustically is referred to as acoustical zoom in the following and represents one example of a consistent audio-video reproduction. The consistent audio-video reproduction which may involve an acoustical zoom is also useful in teleconferencing, where the spatial sound at the near-end side is reproduced at the far-end side together with a visual image. Moreover, it is desirable to recreate the visual zooming effect acoustically so that the visual and acoustical images are aligned.
The first implementation of an acoustical zoom was presented in [1], where the zooming effect was obtained by increasing the directivity of a second-order directional microphone, whose signal was generated based on the signals of a linear microphone array. This approach was extended in [2] to a stereo zoom. A more recent approach for a mono or stereo zoom was presented in [3], which consists in changing the sound source levels such that the source from the frontal direction was preserved, whereas the sources coming from other directions and the diffuse sound were attenuated. The approaches proposed in [1,2] result in an increase of the direct-to-reverberation ratio (DRR) and the approach in [3] additionally allows for the suppression of undesired sources. The aforementioned approaches assume the sound source is located in front of a camera, and do not aim to capture the acoustical image that is consistent with the video image.
A well-known approach for a flexible spatial sound recording and reproduction is represented by directional audio coding (DirAC) [4]. In DirAC, the spatial sound at the near-end side is described in terms of an audio signal and parametric side information, namely the direction-of-arrival (DOA) and diffuseness of the sound. The parametric description enables the reproduction of the original spatial image with arbitrary loudspeaker setups. This means that the recreated spatial image at the far-end side is consistent with the spatial image during recording at the near-end side. However, if for instance a video is complimenting the recorded audio, then the reproduced spatial sound is not necessarily aligned to the video image. Moreover, the recreated acoustical image cannot be adjusted when the visual images changes, e.g., when the look direction and zoom of the camera is changed. This means that DirAC provides no possibility to adjust the recreated acoustical image to an arbitrary desired spatial image.
In [5], an acoustical zoom was realized based on DirAC. DirAC represents a reasonable basis to realize an acoustical zoom as it is based on a simple yet powerful signal model assuming that the sound field in the time-frequency domain is composed of a single plane wave plus diffuse sound. The underlying model parameters, e.g., the DOA and diffuseness, are exploited to separate the direct sound and diffuse sound and to create the acoustical zoom effect. The parametric description of the spatial sound enables an efficient transmission of the sound scene to the far-end side while still providing the user full control over the zoom effect and spatial sound reproduction. Even though DirAC employs multiple microphones to estimate the model parameters, only single-channel filters are applied to extract the direct sound and diffuse sound, limiting the quality of the reproduced sound. Moreover, all sources in the sound scene are assumed to be positioned on a circle and the spatial sound reproduction is performed with reference to a changing position of an audio-visual camera, which is inconsistent with the visual zoom. In fact, zooming changes the view angle of the camera while the distance to the visual objects and their relative positions in the image remain unchanged, which is in contrast to moving a camera.
A related approach is the so-called virtual microphone (VM) technique [6,7] which considers the same signal model as DirAC but allows to synthesize the signal of a non-existing (virtual) microphone in an arbitrary position in the sound scene. Moving the VM towards a sound source is analogous to the movement of the camera to a new position. The VM was realized using multi-channel filters to improve the sound quality, but necessitates several distributed microphone arrays to estimate the model parameters.
However, it would be highly appreciated, if further improved concepts for audio signal processing would be provided.