Spatial audio refers to an immersive audio reproduction system that allows the audience perceive high degree of audio envelopment. This sense of envelopment includes the sensation of spatial location of the audio sources, in both direction and distance, such that the audience perceive the sound scene as if they are in the natural sound environment.
There are three audio recording formats commonly used for spatial audio reproduction system. The format depends on the recording and mixing approach used at the audio content production site. The first format is the most well-known channel-based whereby each channel of audio signals is designated to be playback on a particular loudspeaker at the reproduction site. The second format is called object-based whereby a spatial sound scene can be described by a number of virtual sources (also called objects). Each audio object can be represented by a sound waveform with the associated metadata. The third format is called Ambisonic-based which can be regarded as coefficient signals that represent a spherical expansion of the sound field.
With the proliferation of personal portable devices such as mobile phones, tablets, etc., and emerging applications of virtual/augmented reality, rendering the immersive spatial audio over headphones is becoming more and more necessary and attractive. Binauralization is the process of converting the input spatial audio signals, for example, channel-based signals, object-based signals or Ambisonic-based signals, into the headphone playback signals. In essence, the natural sound scene in a practical environment is perceived by a pair of human ears. This infers that the headphone playback signals should be able to render the spatial sound scene as natural as possible if these playback signals are close to the sounds perceived by the human in the natural environment.
A typical example of the binaural rendering is documented in MPEG-H 3D audio standard [see NPL 1]. FIG. 1 illustrates the flow diagram of rendering the channel-based and object-based input signals to the binaural feeds in MPEG-H 3D audio standard. Given the virtual loudspeaker layout configuration (e.g., 5.1, 7.1 or 22.2), the channel-based signals 1 . . . L1 and object based signals 1 . . . L2 are firstly converted to a number of virtual loudspeaker signals via a format converter (101) and VBAP renderer (102), respectively. The virtual loudspeaker signals are then converted to the binaural signals via a binaural renderer (103) by taking into account the BRIR database.