A well-known technique for artificially positioning a sound in a multi-channel loudspeaker playback system consists of weighting an audio signal by a set of amplifiers feeding each loudspeaker individually. This method, described e.g. in [Chowning71], is often referred to as “discrete amplitude panning” when only the loudspeakers closest to the target direction are assigned non-zero weights, as illustrated by the graph of panning functions in FIG. 1. Although FIG. 1 shows a two-dimensional loudspeaker layout, the method can be extended with no difficulty to three-dimensional loudspeaker layouts, as described e.g. in [Pulkki97]. A drawback of this technique is that it requires a high number of channels to provide a faithful reproduction of all directions. Another drawback is that the geometrical layout of the loudspeakers must be known at the encoding and mixing stage. An alternative approach, described in [Gerzon85], consists of producing a ‘B-Format’ multi-channel signal and reproducing this signal over loudspeakers via an ‘Ambisonic’ decoder, as illustrated in FIG. 2. Instead of discrete panning functions, the B Format uses real-valued spherical harmonics. The zero-order spherical harmonic function is named W, while the three first-order harmonics are denoted X, Y, and Z. These functions are defined as follows:W(σ,φ)=1X(σ,φ)=cos(φ)cos(σ)Y(σ,φ)=cos(φ)sin(σ)Z(σ,φ)=sin(φ)where σ and φ denote respectively the azimuth and elevation angles of the sound source with respect to the listener, expressed in radians. An advantage of this technique over the discrete panning method is that B Format encoding does not require knowledge of the loudspeaker layout, which is taken into account in the design of the decoder. A second advantage is that a real-world B-Format recording can be produced with practical microphone technology, known as the ‘Soundfield Microphone’ [Farrah79]. As illustrated in FIG. 2, this allows for combining microphone-encoded sounds with electronically encoded sounds to produce a single B-format recording. First-order Ambisonic decoders do not reconstruct the acoustic pressure information at the ears of the listener except at low frequencies (below about 700 Hz). As described e.g. in [Bamford95], the frequency range can be extended by increasing the order of spherical harmonics, but only at the expense of a higher number of encoding channels and loudspeakers.
3-D audio reproduction techniques which specifically aim at reproducing the acoustic pressure at the two ears of a listener are usually termed binaural techniques. This approach is illustrated in FIG. 3 and reviewed e.g. in [Jot95]. A binaural recording can be produced by inserting miniature microphones in the ear canals of an individual or dummy head. Binaural encoding of an audio signal (also called binaural synthesis) can be performed by applying to a sound signal a pair of left and right filters modeling the head-related transfer functions (HRTFs) measured on an individual or a dummy head for a given direction. As shown in FIG. 3, a HRTF can be modeled as a cascaded combination of a delaying element and a minimum-phase filter, for each of the left and right channels. A binaurally encoded or recorded signal is suitable for playback over headphones. For playback over loudspeakers, a cross-talk canceller is used, as described e.g. in [Gardner97].
Conventional binaural techniques can provide a more convincing 3-D audio reproduction, over headphones or loudspeakers, than the previously described techniques. However, they are not without their own drawbacks and difficulties.                Compared to discrete amplitude panning or B-Format encoding, binaural synthesis involves a significantly larger amount of computation for each sound source. An accurate finite impulse response (FIR) model of an HRTF typically requires a 1-ms long response, i.e. approximately 100 additions and multiplies per sample period at a sample rate of 48 kHz, which amounts to 5 MIPS (million instructions per second).        The HRTF can only be measured at a set of discrete positions around the head. Designing a binaural synthesis system which can faithfully reproduce any direction and smooth dynamic movements of sounds is a challenging problem involving interpolation techniques and time-variant filters, implying an additional computational effort.        The binaurally recorded or encoded signal contains features related to the morphology of the torso, head, and pinnae. Therefore the fidelity of the reproduction is compromised if the listener's head is not identical to the head used in the recording or the HRTF measurements. In headphone playback, this can cause artifacts such as an artificial elevation of the sound, front-back confusions or inside-the-head localization.        In reproduction over two loudspeakers, the listener must be located at a specific position for lateral sound locations to be convincingly reproduced (beyond the azimuth of the loudspeakers), while rear or elevated sound locations cannot be reproduced reliably.        
[Travis96] describes a method for reducing the computational cost of the binaural synthesis and addresses the interpolation and dynamic issues. This method consists of combining a panning technique designed for N-channel loudspeaker playback and a set of N static binaural synthesis filter pairs to simulate N fixed directions (or “virtual loudspeakers”) for playback over headphones. This technique leads to the topology of FIG. 4a, where a bank of binaural synthesis filters is applied after panning and mixing of the source signals. An alternative approach, described in [Gehring96], consists of applying the binaural synthesis filters before panning and mixing, as illustrated in FIG. 4b. The filtered signals can be produced off-line and stored so that only the panning and mixing computations need to be performed in real time. In terms of reproduction fidelity, these two approaches are equivalent. Both suffer from the inherent limitations of the multi-channel positioning techniques. Namely, they require a large number of encoding channels to faithfully reproduce the localization and timbre of sound signals in any direction.
[Lowe95] describes a variation of the topology of FIG. 4a, in which the directional encoder generates a set of two-channel (left and right) audio signals, with a direction-dependent time delay introduced between the left and right channels, and each two-channel signal is panned between front, back and side “azimuth placement” filters. [Chen96] uses an analysis method known as principal component analysis (PCA) to model any set of HRTFs as a weighted sum of frequency-dependent functions weighted by functions of direction. The two sets of functions are listener-specific (uniquely associated to the head on which the HRTF were measured) and can be used to model the left filter and the right filter applied to the source signal in the directional encoder. [Abel97] also shows the topologies of FIGS. 4a and 4b and uses a singular value decomposition (SVD) technique to model a set of HRTFs in a manner essentially equivalent to the method described in [Chen96], resulting in the simultaneous solution for a set of filters and the directional panning functions.
There remains a need for a computationally efficient technique for high-fidelity 3-D audio encoding and mixing of multiple audio signals. It is desirable to provide an encoding technique that produces a non listener-specific format. There is a need for a practical recording technique and suitably designed decoders to provide faithful reproduction of the pressure signals at the ears of a listener over headphones or two-channel and multi-channel loudspeaker playback systems.