Embodiments of the present invention create a spatial audio processor for providing spatial parameters based on an acoustic input signal. Further embodiments of the present invention create a method for providing spatial parameters based on an acoustic input signal. Embodiments of the present invention may relate to the field of acoustic analysis, parametric description, and reproduction of spatial sound, for example based on microphone recordings.
Spatial sound recording aims at capturing a sound field with multiple microphones such that at the reproduction side, a listener perceives the sound image as it was present at the recording location. Standard approaches for spatial sound recording use simple stereo microphones or more sophisticated combinations of directional microphones, e.g., such as the B-format microphones used in Ambisonics. Commonly, these methods are referred to as coincident-microphone techniques.
Alternatively, methods based on a parametric representation of sound fields can be applied, which are referred to as parametric spatial audio processors. Recently, several techniques for the analysis, parametric description, and reproduction of spatial audio have been proposed. Each system has unique advantages and disadvantages with respect to the type of the parametric description, the type of the needed input signals, the dependence and independence from a specific loudspeaker setup, etc.
An example for an efficient parametric description of spatial sound is given by Directional Audio Coding (DirAC) (V. Pulkki: Spatial Sound Reproduction with Directional Audio Coding, Journal of the AES, Vol. 55, No. 6, 2007). DirAC represents an approach to the acoustic analysis and parametric description of spatial sound (DirAC analysis), as well as to its reproduction (DirAC synthesis). The DirAC analysis takes multiple microphone signals as input. The description of spatial sound is provided for a number of frequency subbands in terms of one or several downmix audio signals and parametric side information containing direction of the sound and diffuseness. The latter parameter describes how diffuse the recorded sound field is. Moreover, diffuseness can be used as a reliability measure for the direction estimate. Another application consists of direction-dependent processing of the spatial audio signal (M. Kallinger et al.: A Spatial Filtering Approach for Directional Audio Coding, 126th AES Convention, Munich, May 2009). On the basis of the parametric representation, spatial audio can be reproduced with arbitrary loudspeaker setups. Moreover, the DirAC analysis can be regarded as an acoustic front-end for parametric coding system that are capable of coding, transmitting, and reproducing multi-channel spatial audio, for instance MPEG Surround.
Another approach to the spatial sound field analysis is represented by the so-called Spatial Audio Microphone (SAM) (C. Faller: Microphone Front-Ends for Spatial Audio Coders, in Proceedings of the AES 125th International Convention, San Francisco, October 2008). SAM takes the signals of coincident directional microphones as input. Similar to DirAC, SAM determines the DOA (DOA—direction of arrival) of the sound for a parametric description of the sound field, together with an estimate of the diffuse sound components.
Parametric techniques for the recording and analysis of spatial audio, such as DirAC and SAM, rely on estimates of specific sound field parameters. The performance of these approaches are, thus, strongly dependant on the estimation performance of the spatial cue parameters such as the direction-of-arrival of the sound or the diffuseness of the sound field.
Generally, when estimating spatial cue parameters, specific assumptions on the acoustic input signals can be made (e.g. on the stationarity or on the tonality) in order to employ the best (i.e. the most efficient or most accurate) algorithm for the audio processing. Traditionally, a single time-invariant signal model can be defined for this purpose. However, a problem that commonly arises is that different audio signals can exhibit a significant temporal variance such that a general time-invariant model describing the audio input is often inadequate. In particular, when considering a single time-invariant signal model for processing audio, model mismatches can occur which degrade the performance of the applied algorithm.