Higher Order Ambisonics denoted HOA offers one possibility to represent three-dimensional sound. Other techniques are wave field synthesis (WFS) or channel based approaches like 22.2. In contrast to channel based methods, the HOA representation offers the advantage of being independent of a specific loudspeaker set-up. However, this flexibility is at the expense of a decoding process which is required for the playback of the HOA representation on a particular loudspeaker set-up. Compared to the WFS approach, where the number of required loudspeakers is usually very large, HOA may also be rendered to set-ups consisting of only few loudspeakers. A further advantage of HOA is that the same representation can also be employed without any modification for binaural rendering to head-phones.
HOA is based on the representation of the spatial density of complex harmonic plane wave amplitudes by a truncated Spherical Harmonics (SH) expansion. Each expansion coefficient is a function of angular frequency, which can be equivalently represented by a time domain function. Hence, without loss of generality, the complete HOA sound field representation actually can be assumed to consist of O time domain functions, where O denotes the number of expansion coefficients. These time domain functions will be equivalently referred to as HOA coefficient sequences or as HOA channels in the following.
The spatial resolution of the HOA representation improves with a growing maximum order N of the expansion. Unfortunately, the number of expansion coefficients O grows quadratically with the order N, in particular O=(N+1)2. For example, typical HOA representations using order N=4 require O=25 HOA (expansion) coefficients. The total bit rate for the transmission of HOA representation, given a desired single-channel sampling rate fS and the number of bits Nb per sample, is determined by O· fS· Nb. Transmitting an HOA representation of order N=4 with a sampling rate of fS=48 kHz employing Nb=16 bits per sample results in a bit rate of 19.2 MBits/s, which is very high for many practical applications, e.g. streaming. Thus, compression of HOA representations is highly desirable.
Previously, the compression of HOA sound field representations was proposed in EP 2665208 A1, EP 2743922 A1, EP 2800401 A1, cf. ISO/IEC JTC1/SC29/WG11, N14264, WD1-HOA Text of MPEG-H 3D Audio, January 2014. These approaches have in common that they perform a sound field analysis and decompose the given HOA representation into a directional component and a residual ambient component. The final compressed representation is on one hand assumed to consist of a number of quantised signals, resulting from the perceptual coding of directional and vector-based signals as well as relevant coefficient sequences of the ambient HOA component. On the other hand it comprises additional side information related to the quantised signals, which side information is required for the reconstruction of the HOA representation from its compressed version.
Before being passed to the perceptual encoder, these intermediate time-domain signals are required to have a maximum amplitude within the value range [−1,1[, which is a requirement arising from the implementation of currently available perceptual encoders. In order to satisfy this requirement when compressing HOA representations, a gain control processing unit (see EP 2824661 A1 and the above-mentioned ISO/IEC JTC1/SC29/WG11 N14264 document) is used ahead of the perceptual encoders, which smoothly attenuates or amplifies the input signals. The resulting signal modification is assumed to be invertible and to be applied frame-wise, where in particular the change of the signal amplitudes between successive frames is assumed to be a power of ‘2’. For facilitating inversion of this signal modification in the HOA decompressor, corresponding normalisation side information is included in total side information. This normalisation side information can consist of exponents to base ‘2’, which exponents describe the relative amplitude change between two successive frames. These exponents are coded using a run length code according to the above-mentioned ISO/IEC JTC1/SC29/WG11 N14264 document, since minor amplitude changes between successive frames are more probable than greater ones.