Ambisonics uses specific coefficients based on spherical harmonics for providing a sound field description that in general is independent from any specific loudspeaker or microphone set-up. This leads to a description which does not require information about loudspeaker positions during sound field recording or generation of synthetic scenes. The reproduction accuracy in an Ambisonics system can be modified by its order N. By that order the number of required audio information channels for describing the sound field can be determined for a 3D system because this depends on the number of spherical harmonic bases. The number O of coefficients or channels is O=(N+1)2.
Representations of complex spatial audio scenes using higher-order Ambisonics (HOA) technology (i.e. an order of 2 or higher) typically require a large number of coefficients per time instant. Each coefficient should have a considerable resolution, typically 24 bit/coefficient or more. Accordingly, the data rate required for transmitting an audio scene in raw HOA format is high. As an example, a 3rd order HOA signal, e.g. recorded with an EigenMike recording system, requires a bandwidth of (3+1)2 coefficients*44100 Hz 24 bit/coefficient=16.15 Mbit/s. As of today, this data rate is too high for most practical applications that require real-time transmission of audio signals. Hence, compression techniques are desired for practically relevant HOA-related audio processing systems.
Higher-order Ambisonics is a mathematical paradigm that allows capturing, manipulating and storage of audio scenes. The sound field is approximated at and around a reference point in space by a Fourier-Bessel series. Because HOA coefficients have this specific underlying mathematics, specific compression techniques have to be applied in order to obtain optimal coding efficiencies. Aspects of both, redundancy and psycho-acoustics, are to be accounted for, and can be expected to function differently for a complex spatial audio scene than for conventional mono or multi-channel signals. A particular difference to established audio formats is that all ‘channels’ in a HOA representation are computed with the same reference location in space. Hence, considerable coherence between HOA coefficients can be expected, at least for audio scenes with few, dominant sound objects.
There exist only few published techniques for lossy compression of HOA signals. Most of them can not be accounted to the category of perceptual coding because typically no psycho-acoustic model is utilized for controlling the compression. In contrast, several existing schemes use a decomposition of the audio scene into parameters of an underlying model.
Early Approaches for 1st to 3rd-Order Ambisonics Transmission
The theory of Ambisonics has been in use for audio production and consumption since the 1960's, although up to now the applications were mostly limited to 1st or 2nd order content. A number of distribution formats have been in use, in particular:                B-format: This format is the standard professional, raw signal format used for exchange of content among researchers, producers and enthusiasts. Typically, it relates to 1st order Ambisonics with specific normalization of the coefficients, but there also exist specifications up to order 3.        In recent higher-order variants of the B-format, modified normalization schemes like SN3D, and special weighting rules, e.g. the Furse-Malham aka FuMa or FMH set, typically result in a downscaling of the amplitudes of parts of the Ambisonics coefficient data. The reverse upscaling operation is performed by table lookup before decoding at receiver side.        UHJ-format (aka C-format): This is a hierarchical encoded signal format that is applicable for delivering 1st order Ambisonics content to consumers via existing mono or two-channel stereo paths. With two channels, left and right, a full horizontal surround representation of an audio scene is feasible, albeit not with full spatial resolution. The optional third channel improves the spatial resolution in the horizontal plane, and the optional fourth channel adds the height dimension.        G-format: This format was created in order to make content produced in Ambisonics format available to anyone, without the need to use specific Ambisonics decoders at home. Decoding to the standard 5-channel surround setup is performed already at production side. Because the decoding operation is not standardized, a reliable reconstruction of the original B-format Ambisonics content is not possible.        D-format: This format refers to the set of decoded loudspeaker signals as produced by an arbitrary Ambisonics decoder. The decoded signals depend on the specific loudspeaker geometry and on specifics of the decoder design. The G-format is a subset of the D-format definition, because it refers to a specific 5-channel surround setup.        
Neither one of the aforementioned approaches has been designed with compression in mind. Some of the formats have been tailored in order to make use of existing, low-capacity transmission paths (e.g. stereo links) and therefore implicitly reduce the data rate for transmission. However, the downmixed signal lacks a significant portion of original input signal information. Thus, the flexibility and universality of the Ambisonics approach is lost.
Directional Audio Coding
Around 2005 the DirAC (directional audio coding) technology has been developed, which is based on a scene analysis with the target to decompose the scene into one dominant sound object per time and frequency plus ambient sound. The scene analysis is based on an evaluation of the instantaneous intensity vector of the sound field. The two parts of the scene will be transmitted together with location information on where the direct sound comes from. At the receiver, the single dominant sound source per time-frequency pane is played back using vector based amplitude panning (VBAP). In addition, de-correlated ambient sound is produced according to the ratio that has been transmitted as side information. The DirAC processing is depicted in FIG. 1, wherein the input signals have B-format.
One can interpret DirAC as a specific way of parametric coding with a single-source-plus-ambience signal model. The quality of the transmission depends strongly on whether the model assumptions are true for the particular compressed audio scene. Furthermore, any erroneous detection of direct sound and/or ambient sound in the sound analysis stage may impact the quality of the playback of the decoded audio scene. To date, DirAC has only been described for 1st order Ambisonics content.
Direct Compression of HOA Coefficients
In the late 2000s, a perceptual as well as lossless compression of HOA signals has been proposed.                For lossless coding, cross-correlation between different Ambisonics coefficients is exploited for reducing the redundancy of HOA signals, as described in E. Hellerud, A. Solvang, U. P. Svensson, “Spatial Redundancy in Higher Order Ambisonics and Its Use for Low Delay Lossless Compression”, Proc. of IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), April 2009, Taipei, Taiwan, and in E. Hellerud, U. P. Svensson, “Lossless Compression of Spherical Microphone Array Recordings”, Proc. of 126th AES Convention, Paper 7668, May 2009, Munich, Germany. Backward adaptive prediction is utilized which predicts current coefficients of a specific order from a weighted combination of preceding coefficients up to the order of the coefficient to be encoded. The groups of coefficients that are expected to exhibit strong cross-correlation have been found by evaluations of characteristics of real-world content.        This compression operates in a hierarchical manner. The neighborhood analyzed for potential cross-correlation of a coefficient comprises the coefficients only up to the same order at the same time instant as well as at preceding time instances, whereby the compression is scalable on bit stream level.        Perceptual coding is described in T. Hirvonen, J. Ahonen, V. Pulkki, “Perceptual Compression Methods for Metadata in Directional Audio Coding Applied to Audiovisual Teleconference”, Proc. of 126th AES Convention, Paper 7706, May 2009, Munich, Germany, and in the above-mentioned “Spatial Redundancy in Higher Order Ambisonics and Its Use for Low Delay Lossless Compression” article. Existing MPEG AAC compression techniques are used for coding the individual channels (i.e. coefficients) of an HOA B-format representation. By adjusting the bit allocation depending on the order of the channel, a non-uniform spatial noise distribution has been obtained. In particular, by allocating more bits to the low-order channels and fewer bits to high-order channels, a superior precision can be obtained near the reference point. In turn, the effective quantization noise rises for increasing distances from the origin.        
FIG. 2 shows the principle of such direct encoding and decoding of B-format audio signals, wherein the upper path shows the above Hellerud et al. compression and the lower path shows compression to conventional D-format signals. In both cases the decoded receiver output signals have D-format.
A problem with seeking for redundancy and irrelevancy directly in the HOA domain is that any spatial information is, in general, ‘smeared’ across several HOA coefficients. In other words, information that is well localized and concentrated in spatial domain is spread around. Thereby it is very challenging to perform a consistent noise allocation that reliably adheres to psycho-acoustic masking constraints. Furthermore, important information is captured in a differential fashion in the HOA domain, and subtle differences of large-scale coefficients may have a strong impact in the spatial domain. Therefore a high data rate may be required in order to preserve such differential details.
Spatial Squeezing
More recently, B. Cheng, Ch. Ritz, I. Burnett have developed the ‘spatial squeezing’ technology:    B. Cheng, Ch. Ritz, I. Burnett, “Spatial Audio Coding by Squeezing: Analysis and Application to Compressing Multiple Soundfields”, Proc. of European Signal Processing Conf. (EUSIPCO), 2009,    B. Cheng, Ch. Ritz, I. Burnett, “A Spatial Squeezing Approach to Ambisonic Audio Compression”, Proc. of IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), April 2008,    B. Cheng, Ch. Ritz, I. Burnett, “Principles and Analysis of the Squeezing Approach to Low Bit Rate Spatial Audio Coding”, Proc. of IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), April 2007.
An audio scene analysis is carried out which decomposes the sound field into the selection of the most dominant sound objects for each time/frequency pane. Then a 2-channel stereo downmix is created which contains these dominant sound objects at new positions, in-between the positions of the left and right channels. Because the same analysis can be done with the stereo signal, the operation can be partially reversed by re-mapping the objects detected in the 2-channel stereo downmix to the 360° of the full sound field.
FIG. 3 depicts the principle of spatial squeezing. FIG. 4 shows the related encoding processing.
The concept is strongly related to DirAC because it relies on the same kind of audio scene analysis. However, in contrast to DirAC the downmix always creates two channels, and it is not necessary to transmit side information about the location of dominant sound objects.
Although psycho-acoustic principles are not explicitly utilized, the scheme exploits the assumption that a decent quality can already be achieved by only transmitting the most prominent sound object for time-frequency tiles. In that respect, there are further strong parallels to the assumptions of DirAC. Analog to DirAC, any error in the parameterization of the audio scene will result in an artifact of the decoded audio scene. Furthermore, the impact of any perceptual coding of the 2-channel stereo downmix signal to the quality of the decoded audio scene is hard to predict. Due to the generic architecture of this spatial squeezing it can not be applied for 3-dimensional audio signals (i.e. signals with height dimension), and apparently it does not work for Ambisonics orders beyond one.
Ambisonics Format and Mixed-Order Representations
It has been proposed in F. Zotter, H. Pomberger, M. Noisternig, “Ambisonic Decoding with and without Mode-Matching: A Case Study Using the Hemisphere”, Proc. of 2nd Ambisonics Symposium, May 2010, Paris, France, to constrain the spatial sound information to a sub-space of the full sphere, e.g. to only cover the upper hemisphere or even smaller parts of the sphere. In the ultimate, a complete scene can be composed of several such constrained ‘sectors’ on the sphere which will be rotated to specific locations for assembling the target audio scene. This creates a kind of mixed-order composition of a complex audio scene. No perceptual coding is mentioned.
Parametric Coding
The ‘classic’ approach for describing and transmitting content intended to be played back in wave-field synthesis (WFS) systems is via parametric coding of individual sound objects of the audio scene. Each sound object consists of an audio stream (mono, stereo or something else) plus meta information on the role of the sound object within the full audio scene, i.e. most importantly the location of the object. This object-oriented paradigm has been refined for WFS playback in the course of the European ‘CARROUSO’, cf. S. Brix, Th. Sporer, J. Plogsties, “CARROUSO—An European Approach to 3D-Audio”, Proc. of 110th AES Convention, Paper 5314, May 2001, Amsterdam, The Netherlands.
One example for compressing each sound object independent from others is the joint coding of multiple objects in a downmix scenario as described in Ch. Faller, “Parametric Joint-Coding of Audio Sources”, Proc. of 120th AES Convention, Paper 6752, May 2006, Paris, France, in which simple psycho-acoustic cues are used in order to create a meaningful downmix signal from which, with the help of side information, the multi-object scene can be decoded at the receiver side. The rendering of the objects within the audio scene to the local loudspeaker setup also takes place at receiver side.
In object-oriented formats recording is particularly sophisticated. In theory, perfectly ‘dry’ recordings of the individual sound objects would be required, i.e. recordings that exclusively capture the direct sound emitted by a sound object. The challenge of this approach is two-fold: first, dry capturing is difficult in natural ‘live’ recordings because there is considerable crosstalk between microphone signals; second, audio scenes which are assembled from dry recordings lack naturalness and the ‘atmosphere’ of the room in which the recording took place.
Parametric Coding Plus Ambisonics
Some researchers have proposed to combine an Ambisonics signal with a number of discrete sound objects. The rationale is to capture ambient sound and sound objects that are not well localizable via the Ambisonics representation and to add a number of discrete, well-placed sound objects via a parametric approach. For the object-oriented part of the scene similar coding mechanisms are used as for purely parametric representations (see the previous section). That is, those individual sound objects typically come with a mono sound track and information on location and potential movements, cf. the introduction of Ambisonics playback to the MPEG-4 AudioBIFS standard. In that standard, how to transmit the raw Ambisonics and object streams to the (AudioBIFS) rendering engine is left open to the producer of an audio scene. This means that any audio codec defined in MPEG-4 can be used for directly encoding the Ambisonics coefficients.
Wave Field Coding
Instead of using the object-oriented approach, wave field coding transmits the already rendered loudspeaker signals of a WFS (wave field synthesis) system. The encoder carries out all the rendering to a specific set of loudspeakers. A multi-dimensional space-time to frequency transformation is performed for windowed, quasi-linear segments of the curved line of loudspeakers. The frequency coefficients (both for time-frequency and space-frequency) are encoded with some psycho-acoustic model. In addition to the usual time-frequency masking, also a space-frequency masking can be applied, i.e. it is assumed that masking phenomena are a function of spatial frequency. At decoder side the encoded loudspeaker channels are de-compressed and played back.
FIG. 5 shows the principle of Wave Field Coding with a set of microphones in the top part and a set of loudspeakers in the bottom part. FIG. 6 shows the encoding processing according to F. Pinto, M. Vetterli, “Wave Field Coding in the Spacetime Frequency Domain”, Proc. of IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), April 2008, Las Vegas, Nev., USA.
Published experiments on perceptual wave field coding show that the space-time-to-frequency transform saves about 15% of data rate compared to separate perceptual compression of the rendered loudspeaker channels for a two-source signal model. Nevertheless, this processing has not the compression efficiency to be obtained by an object-oriented paradigm, most probably due to the failure to capture sophisticated cross-correlation characteristics between loudspeaker channels because a sound wave will arrive at each loudspeaker at a different time. A further disadvantage is the tight coupling to the particular loudspeaker layout of the target system.
Universal Spatial Cues
The notion of a universal audio codec able to address different loudspeaker scenarios has also been considered, starting from classical multi-channel compression. In contrast to e.g. mp3 Surround or MPEG Surround with fixed channel assignments and relations, the representation of spatial cues is designed to be independent of the specific input loudspeaker configuration, cf. M. M. Goodwin, J.-M. Jot, “A Frequency-Domain Framework for Spatial Audio Coding Based on Universal Spatial Cues”, Proc. of 120th AES Convention, Paper 6751, May 2006, Paris, France; M. M. Goodwin, J.-M. Jot, “Analysis and Synthesis for Universal Spatial Audio Coding”, Proc. of 121st AES Convention, Paper 6874, October 2006, San Francisco, Calif., USA; M. M. Goodwin, J.-M. Jot, “Primary-Ambient Signal Decomposition and Vector-Based Localization for Spatial Audio Coding and Enhancement”, Proc. of IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), April 2007, Honolulu, Hi., USA.
Following frequency domain transformation of the discrete input channel signals, a principal component analysis is performed for each time-frequency tile in order to distinguish primary sound from ambient components. The result is the derivation of direction vectors to locations on a circle with unit radius centered at the listener, using Gerzon vectors for the scene analysis.
FIG. 7 depicts a corresponding system for spatial audio coding with downmixing and transmission of spatial cues. A (stereo) downmix signal is composed from the separated signal components and transmitted together with meta information on the object locations. The decoder recovers the primary sound and some ambient components from the downmix signals and the side information, whereby the primary sound is panned to local loudspeaker configuration. This can be interpreted as a multi-channel variant of the above DirAC processing because the transmitted information is very similar.