In multi-channel reproduction and listening, a listener is generally surrounded by multiple loudspeakers. One general goal in reproduction is to construct an acoustic field in which the listener is capable of perceiving the intended location of the sound sources, for example, the location of a musician in a band. Different loudspeaker setups can create different spatial impressions. For example, standard stereo setups can convincingly recreate the acoustic scene in the space between the two loudspeakers, but fail to that purpose in angles outside the two loudspeakers.
Setups with more loudspeakers surrounding the listener can achieve a better spatial impression in a wider set of angles. For example, one of the most well-known multi-loudspeaker layout standard is the surround 5.1 (ITU-R775-1), consisting of 5 loudspeakers located at azimuths of −30, 0, 30, −110, 110 degrees about the listener, where 0 refers to the frontal direction. However, such setup cannot cope with sounds above the listener's horizontal plane.
To increase the immersive experience of the listener, the present tendency is to exploit many-loudspeaker setups, including loudspeakers at different heights. One example is the 22.2 system developed by Hamasaki at the NHK, Japan, which consists of a total of 24 loudspeakers located at three different heights.
The present paradigm for producing spatialised audio in professional applications for such setups is to provide one audio track for each channel used in reproduction. For example, 2 audio tracks are needed for a stereo setup; 6 audio tracks are needed in a 5.1 setup, etc. These tracks are normally the result of the postproduction stage, although they can also be produced directly in the recording stage for broadcasting. It is worth noticing that in many occasions a few loudspeakers are used to reproduce exactly the same audio channels. This is the case of most 5.1 cinema theatres, where each surround channel is played-back through three or more loudspeakers. Thus, in these occasions, although the number of loudspeakers might be larger than 6, the number of different audio channels is still 6, and there are only 6 different signals played-back in total.
One consequence of this one-track-per-channel paradigm is that it links the work done at the recording and postproduction stages to the exhibition setup where the content is to be exhibited. At the recording stage, for example in broadcasting, the type and position of the microphones used and the way they are mixed is decided as a function of the setups where the event is to be reproduced. Similarly, in media production, postproduction engineers need to know the details of the setup where the content will be exhibited, and then take care of every channel. Failure of correctly setting up the exhibition multi-loudspeaker layout for which the content was tailored will result in a decrease of reproduction quality. If content is to be exhibited in different setups, then different versions need to be created in postproduction. This results in an increase of costs and time consumption.
Another consequence of this one-track-per-channel paradigm is the size of data needed. On the one hand, without further encoding, the paradigm requires as many audio tracks as channels. On the other hand, if different versions are to be provided, they are either provided separately, which again increases the size of the data, or some down-mix needs to be performed, which compromises the resulting quality.
Finally, another downside of the one-track-per-channel paradigm is that content produced in this manner is not future proof. For example, the 6 tracks present in a given film produced for a 5.1 setup do not include audio sources located above the listener, and do not fully exploit setups with loudspeakers at different heights.
Currently, there exist a few technologies capable of providing exhibition system independent spatialised audio. Perhaps the simplest technology is amplitude panning, like the so-called Vector-Based Amplitude Panning (VBAP). It is based on feeding the same mono signal to the loudspeakers that are closer to the position where the sound source is intended to be located, with an adjustment of the volume for each loudspeaker. Such systems can work in 2D or 3D (with height) setups, typically by selecting the two or three closer loudspeakers, respectively. One virtue of this method is that it provides a large sweet-spot, meaning that there is a wide region inside the loudspeakers setup where sound is perceived as incoming from the intended direction. However, this method is neither suitable for reproducing reverberant fields, like those present in reverberant rooms, nor sound sources with a large spread. At most the first rebounds of the sound emitted by the sources can be reproduced with these methods, but it provides a costly low-quality solution.
Ambisonics is another technology capable of providing exhibition system independent spatialised audio. Originated in the 70s by Michael Gerzon, it provides a complete encoding-decoding chain methodology. At encoding, a set of spherical harmonics of the acoustic field at one point are saved. The zeroth order (W) corresponds to what an omnidirectional microphone would record at that point. The first order, consisting of 3 signals (X,Y,Z), corresponds to what three figure-of-eight microphones at that point, aligned with Cartesian axes would record. Higher order signals correspond to what microphones with more complicated patterns would record. There exist mixed order Ambisonics encoding, where only some subsets of the signals of each order are used; for example, by using only the W, X, Y signals in first-order Ambisonics, thus neglecting the Z signal. Although the generation of signals beyond first order is simple in postproduction or via acoustic field simulations, it is more difficult when recording real acoustic fields with microphones; indeed, only microphones capable of measuring zero and first order signals have been available for professional applications until very recently. Examples of first-order Ambisonics microphones are the Soundfield and the more recent TetraMic. At decoding, once the multi-loudspeaker setup is specified (number and position of every loudspeaker), the signal to be fed to each loudspeaker is typically determined by requiring that the acoustic field created by the complete setup approximates as much as possible the intended field (either the one created in postproduction, or the one from where the signals where recorded). Besides exhibition-system independence, further advantages of this technology are the high degree of manipulation that it offers (basically soundscape rotation and zoom), and its capability of faithfully reproducing reverberant field.
However, Ambisonics technology presents two main disadvantages: the incapability to reproduce narrow sound sources, and the small size of the sweet-spot. The concept of narrow or spread sources is used in this context as referring to the angular width of the perceived sound image. The first problem is due to the fact that, even when trying to reproduce a very narrow sound source, Ambisonics decoding turns on more loudspeakers than just the ones closer to the intended position of the source. The second problem is due to the fact that, although at the sweet-spot, the waves coming from every loudspeaker add in phase to create the desired acoustic field, outside the sweet-spot, waves do not interfere with the correct phase. This changes the colouration of sound and, more importantly, sound tends to be perceived as incoming from the loudspeaker closer to the listener due to the well-known psychoacoustical precedence effect. For a fixed size of the listening room, the only way to reduce both problems is to increase the Ambisonics order used, but this implies a rapid growth in the number of channels and loudspeakers involved.
It is worth mentioning that another technology exists capable of exactly reproducing an arbitrary sound field, the so-called Wave Field Synthesis (WFS). However, this technology requires the loudspeakers to be separated less than 15-20 cm, a fact that requires further approximations (and consequent loss of quality) and increases enormously the number of loudspeakers required; present applications use between 100 and 500 loudspeakers, which narrows its applicability to very high-end customized events.
It is desirable to provide a technology capable of providing spatialized audio content that can be distributed independently of the exhibition setup, be it 2D or 3D; which, once the setup is specified, can be decoded to fully exploit its capabilities; which is capable of reproducing all type of acoustic fields (narrow sources, reverberant or diffuse fields) to all listeners within the space, that is, with a large sweet-spot; and which does not require a large number of loudspeakers. This would make it possible to create future-proof content, in the sense that it would easily adapt to all present and future multi-loudspeaker setups, and it would also make it possible for the cinema theatres or home users to choose the multi-loudspeaker setup that best fits their needs and purposes, with the benefit of being sure that there will be plenty of content that will fully exploit the capabilities of their chosen setup.