Higher Order Ambisonics (HOA) offers the advantage of capturing a complete sound field in the vicinity of a specific location in the three dimensional space, which location is called ‘sweet spot’. Such HOA representation is independent of a specific loudspeaker set-up, in contrast to channel-based techniques like stereo or surround. But this flexibility is at the expense of a decoding process required for playback of the HOA representation on a particular loudspeaker set-up.
HOA is based on the description of the complex amplitudes of the air pressure for individual angular wave numbers k for positions x in the vicinity of a desired listener position, which without loss of generality may be assumed to be the origin of a spherical coordinate system, using a truncated Spherical Harmonics (SH) expansion. The spatial resolution of this representation improves with a growing maximum order N of the expansion. Unfortunately, the number of expansion coefficients O grows quadratically with the order N, i.e. O=(N+1)2. For example, typical HOA representations using order N=4 require O=25 HOA coefficients. Given a desired sampling rate fs and the number Nb of bits per sample, the total bit rate for the transmission of an HOA signal representation is determined by O·fs·Nb, and transmission of an HOA signal representation of order N=4 with a sampling rate of fs=48 kHz employing Nb=16 bits per sample is resulting in a bit rate of 19.2 MBits/s. Thus, compression of HOA signal representations is highly desirable.
An overview of existing spatial audio compression approaches can be found in patent application EP 10306472.1 or in I. Elfitri, B. Günel, A. M. Kondoz, “Multichannel Audio Coding Based on Analysis by Synthesis”, Proceedings of the IEEE, vol. 99, no. 4, pp. 657-670, April 2011.
The following techniques are more relevant with respect to the invention.
B-format signals, which are equivalent to Ambisonics representations of first order, can be compressed using Directional Audio Coding (DirAC) as described in V. Pulkki, “Spatial Sound Reproduction with Directional Audio Coding”, Journal of Audio Eng. Society, vol. 55(6), pp. 503-516, 2007. In one version proposed for teleconference applications, the B-format signal is coded into a single omni-directional signal as well as side information in the form of a single direction and a diffuseness parameter per frequency band. However, the resulting drastic reduction of the data rate comes at the price of a minor signal quality obtained at reproduction. Further, DirAC is limited to the compression of Ambisonics representations of first order, which suffer from a very low spatial resolution.
The known methods for compression of HOA representations with N>1 are quite rare. One of them performs direct encoding of individual HOA coefficient sequences employing the perceptual Advanced Audio Coding (AAC) codec, c.f. E. Hellerud, I. Burnett, A. Solvang, U. Peter Svensson, “Encoding Higher Order Ambisonics with AAC”, 124th AES Convention, Amsterdam, 2008. However, the inherent problem with such approach is the perceptual coding of signals that are never listened to. The reconstructed playback signals are usually obtained by a weighted sum of the HOA coefficient sequences. That is why there is a high probability for the unmasking of perceptual coding noise when the decompressed HOA representation is rendered on a particular loudspeaker set-up. In more technical terms, the major problem for perceptual coding noise unmasking is the high cross-correlations between the individual HOA coefficients sequences. Because the coded noise signals in the individual HOA coefficient sequences are usually uncorrelated with each other, there may occur a constructive superposition of the perceptual coding noise while at the same time the noise-free HOA coefficient sequences are cancelled at superposition. A further problem is that the mentioned cross correlations lead to a reduced efficiency of the perceptual coders.
In order to minimise the extent these effects, it is proposed in EP 10306472.1 to transform the HOA representation to an equivalent representation in the spatial domain before perceptual coding. The spatial domain signals correspond to conventional directional signals, and would correspond to the loudspeaker signals if the loudspeakers were positioned in exactly the same directions as those assumed for the spatial domain transform.
The transform to spatial domain reduces the cross-correlations between the individual spatial domain signals. However, the cross-correlations are not completely eliminated. An example for relatively high cross-correlations is a directional signal, whose direction falls in-between the adjacent directions covered by the spatial domain signals.
A further disadvantage of EP 10306472.1 and the above-mentioned Hellerud et al. article is that the number of perceptually coded signals is (N+1)2, where N is the order of the HOA representation. Therefore the data rate for the compressed HOA representation is growing quadratically with the Ambisonics order.
The inventive compression processing performs a decomposition of an HOA sound field representation into a directional component and an ambient component. In particular for the computation of the directional sound field component a new processing is described below for the estimation of several dominant sound directions.
Regarding existing methods for direction estimation based on Ambisonics, the above-mentioned Pulkki article describes one method in connection with DirAC coding for the estimation of the direction, based on the B-format sound field representation. The direction is obtained from the average intensity vector, which points to the direction of flow of the sound field energy. An alternative based on the B-format is proposed in D. Levin, S. Gannot, E. A. P. Habets, “Direction-of-Arrival Estimation using Acoustic Vector Sensors in the Presence of Noise”, IEEE Proc. of the ICASSP, pp. 105-108, 2011. The direction estimation is performed iteratively by searching for that direction which provides the maximum power of a beam former output signal steered into that direction.
However, both approaches are constrained to the B-format for the direction estimation, which suffers from a relatively low spatial resolution. An additional disadvantage is that the estimation is restricted to only a single dominant direction.
HOA representations offer an improved spatial resolution and thus allow an improved estimation of several dominant directions. The existing methods performing an estimation of several directions based on HOA sound field representations are quite rare. An approach based on compressive sensing is proposed in N. Epain, C. Jin, A. van Schaik, “The Application of Compressive Sampling to the Analysis and Synthesis of Spatial Sound Fields”, 127th Convention of the Audio Eng. Soc., New York, 2009, and in A. Wabnitz, N. Epain, A. van Schaik, C Jin, “Time Domain Reconstruction of Spatial Sound Fields Using Compressed Sensing”, IEEE Proc. of the ICASSP, pp. 465-468, 2011. The main idea is to assume the sound field to be spatially sparse, i.e. to consist of only a small number of directional signals. Following allocation of a high number of test directions on the sphere, an optimisation algorithm is employed in order to find as few test directions as possible together with the corresponding directional signals, such that they are well described by the given HOA representation. This method provides an improved spatial resolution compared to that which is actually provided by the given HOA representation, since it circumvents the spatial dispersion resulting from a limited order of the given HOA representation. However, the performance of the algorithm heavily depends on whether the sparsity assumption is satisfied. In particular, the approach fails if the sound field contains any minor additional ambient components, or if the HOA representation is affected by noise which will occur when it is computed from multi-channel recordings.
A further, rather intuitive method is to transform the given HOA representation to the spatial domain as described in B. Rafaely, “Plane-wave decomposition of the sound field on a sphere by spherical convolution”, J. Acoust. Soc. Am., vol. 4, no. 116, pp. 2149-2157, October 2004, and then to search for maxima in the directional powers. The disadvantage of this approach is that the presence of ambient components leads to a blurring of the directional power distribution and to a displacement of the maxima of the directional powers compared to the absence of any ambient component.