Spatial audio processing is the effect of an audio signal emanating from an audio source arriving at the left and right ears of a listener via different propagation paths. As a consequence of this effect the signal at the left ear will typically have a different arrival time and signal level to that of the corresponding signal arriving at the right ear. The difference between the times and signal levels are functions of the differences in the paths by which the audio signal travelled in order to reach the left and right ears respectively. The listener's brain then interprets these differences to give the perception that the received audio signal is being generated by an audio source located at a particular distance and direction relative to the listener.
An auditory scene therefore may be viewed as the net effect of simultaneously hearing audio signals generated by one or more audio sources located at various positions relative to the listener.
The mere fact that the human brain can process a binaural input signal in order to ascertain the position and direction of a sound source can be used to code and synthesise auditory scenes. A typical method of spatial auditory coding may thus attempt to model the salient features of an audio scene, by purposefully modifying audio signals from one or more different sources (channels). This may be for headphone use defined as left and right audio signals. These left and right audio signals may be collectively known as binaural signals. The resultant binaural signals may then be generated such that they give the perception of varying audio sources located at different positions relative to the listener. The binaural signal differs from a stereo signal in two respects. Firstly, a binaural signal has incorporated the time difference between left and right is and secondly the binaural signal employs the “head shadow effect” (where a reduction of volume for certain frequency bands is modelled).
Recently, spatial audio techniques have been used in connection with multi-channel audio reproduction. The objective of multichannel audio reproduction is to provide for efficient coding of multi channel audio signals comprising a plurality of separate audio channels or sound sources. Recent approaches to the coding of multichannel audio signals have centred on the methods of parametric stereo (PS) and Binaural Cue Coding (BCC). BCC typically encodes the multi-channel audio signal by down mixing the input audio signals into either a single (“sum”) channel or a smaller number of channels conveying the “sum” signal. In parallel, the most salient inter channel cues, otherwise known as spatial cues, describing the multi-channel sound image or audio scene are extracted from the input channels and coded as side information. Both the sum signal and side information form the encoded parameter set which can then either be transmitted as part of a communication chain or stored in a store and forward type device. Most implementations of the BCC technique typically employ a low bit rate audio coding scheme to further encode the sum signal. Finally, the BCC decoder generates a multi-channel output signal from the transmitted or stored sum signal and spatial cue information. Typically down mix signals employed in spatial audio coding systems are additionally encoded using low bit rate perceptual audio coding techniques such as AAC to further reduce the required bit rate.
Multi-channel audio coding where there is more than two sources have so far only been used in home theatre applications where bandwidth is not typically seen to be a major limitation. However multi-channel audio coding may be used in emerging multi-microphone implementations on many mobile devices to help exploit the full potential of these multi-microphone technologies. For example, multi-microphone systems may be used to produce better signal to noise ratios in communications in poor audio environments, by for example, enabling an audio zooming at the receiver where the receiver has the ability to focus on a specific source or direction in the received signal. This focus can then be changed dependent on the source required to be improved by the receiver.
Multi-channel systems as hinted above have an inherent problem in that an N channel/microphone source system when directly encoded produces a bit stream which requires approximately the N times the bandwidth of a single channel.
This multi-channel bandwidth requirement is typically prohibitive for wireless communication systems.
It is known that it may be possible to model a multi-channel/multi-source system by assuming that each channel has recorded the same source signals but with different time-delay and frequency dependent amplification characteristics. In some approaches used to reduce the bandwidth requirements (such as the binaural coding approached described above), it has been believed that the N channels could be joined into a single channel which is level (intensity) and time aligned. However this produces a problem in that the level and time alignment differs for different time and frequency elements. Furthermore there are typically several source signals occupying the same time-frequency location with each source signal requiring a different time and level alignment.
A separate approach that has been proposed has been to solve the problem of separating all of the audio sources (in other words the original source of the audio signal which is then detected by the microphone) from the signals and modelling the direction and acoustics of the original sources and the spaces defined by the microphones. However, this is computationally difficult and requires a large amount of processing power. Furthermore this approach may require separately encoding all of the original sources, and the number of original sources may exceed the number of original channels. In other words the number of modelled original sources may be greater than the number of microphone channels used to record the audio environment.
Currently therefore systems typically only code a multi-channel system as a single or small number of channels and code the other channels as a level or intensity difference value from the nearest channel. For example in a two (left and right) channel system typically a single mono-channel is created by averaging the left and right channels and then the signal energy level in the frequency band for both the left and right channels in a two-channel system is quantized and coded and stored/sent to the receiver. At the receiver/decoder, the mono-signal is copied to both channels and the signal levels in the left and right channels are set to match the received energy information in each frequency band in both recreated channels.
This type of system, due to the encoding, produces a less than optimal audio image and is unable to produce the depth of audio that a multi-channel system can produce