Spatial audio processing is the effect of an audio signal emanating from an audio source arriving at the left and right ears of a listener via different propagation paths. As a consequence of this effect the signal at the left ear will typically have a different arrival time and signal level to that of the corresponding signal arriving at the right ear. The difference between the times and signal levels are functions of the differences in the paths by which the audio signal travelled in order to reach the left and right ears respectively. The listener's brain then interprets these differences to give the perception that the received audio signal is being generated by an audio source located at a particular distance and direction relative to the listener.
An auditory scene therefore maybe viewed as the net effect of simultaneously hearing audio signals generated by one or more audio sources located at various positions relative to the listener.
The mere fact that the human brain can process a binaural input signal in order to ascertain the position and direction of a sound source can be used to code and synthesis auditory scenes. A typical method of spatial auditory coding will therefore seek to model the salient features of an audio scene. This normally entails purposefully modifying audio signals from one or more different sources in order to generate left and right audio signals. In the art these signals may be collectively known as binaural signals. The resultant binaural signals may then be generated such that they give the perception of varying audio sources located at different positions relative to the listener.
Recently, spatial audio techniques have been used in connection with multi-channel audio reproduction. The objective of multichannel audio reproduction is to provide for efficient coding of multi channel audio signals comprising five of more (a plurality) of separate audio channels or sound sources. Recent approaches to the coding of multichannel audio signals have centred on the methods of parametric stereo (PS) and Binaural Cue Coding (BCC). BCC typically encodes the multi-channel audio signal by down mixing the various input audio signals into either a single (“sum”) channel or a smaller number of channels conveying the “sum” signal. In parallel, the most salient inter channel cues, otherwise known as spatial cues, describing the multi-channel sound image or audio scene are extracted from the input channels and coded as side information. Both the sum signal and side information form the encoded parameter set which can then either be transmitted as part of a communication chain or stored in a store and forward type device. Most implementations of the BCC technique typically employ a low bit rate audio coding scheme to further encode the sum signal. Finally, the BCC decoder generates a multi-channel output signal from the transmitted or stored sum signal and spatial cue information. Further information regarding the BCC technique can be found in the following IEEE publication Binaural Cue Coding—Part II Schemes and Applications in IEEE Transactions on Speech and Audio Processing, Vol. 11, No 6, November 2003 by Baumgarte, F. and Faller, C. Typically down mix signals employed in spatial audio coding systems are additionally encoded using low bit rate perceptual audio coding techniques such as the ISO/IEC Moving Pictures Expert Group Advanced Audio Coding standard to further reduce the required bit rate.
In typical implementations of spatial audio multichannel coding the set of spatial cues comprise; an inter channel level difference parameter (ICLD) which models the relative difference in audio levels between two channels, and an inter channel time delay value (ICTD) which represents the time difference or phase shift of the signal between the two channels. The audio level and time differences are usually determined for each channel with respect to a reference channel. Alternatively some systems may generate the spatial audio cues with the aide of head related transfer function (HRTF). Further information on such techniques may be found in The Psychoacoustics of Human Sound Localization by J. Blaubert and published in 1983 by the MIT Press.
Although ICLD and ICTD parameters represent the most important spatial audio cues, spatial representations using these parameters may be further enhanced with the incorporation of an inter channel coherence (ICC) parameter. By incorporating such a parameter into the set of spatial audio cues allows the perceived spatial “diffuseness” or conversely the spatial “compactness” to be represented in the reconstructed signal.
For BCC one of the major issues to be solved is the representation and efficient coding of the parameters associated with the coding process. As stated before the down mix signal may be efficiently coded using conventional audio source coding techniques such as AAC, and this efficient coding doctrine may also be applied to the spatial cue parameters. However coding typically introduces errors into the spatial cue parameters and one of the challenges is to be able to increase the spatial audio experience to the listener without having to expend any further coding bandwidth than is absolutely necessary. One technique commonly used in speech and audio coding which may be applied to BCC is to enhance particular regions of the signal to be encoded in order to mask any errors introduced by the process of coding, and to improve the overall perceived audio experience.