Spatial audio processing is the effect of an audio signal originating from an audio source arriving at the left and right ears of a listener via different propagation paths. As a consequence of this effect the signal at the left ear will typically have a different arrival time and signal level from those of the corresponding signal arriving at the right ear. The differences between the arrival times and signal levels are functions of the differences in the paths by which the audio signal travelled in order to reach the left and right ears respectively. The listener's brain then interprets these differences to give the perception that the received audio signal is being generated by an audio source located at a particular distance and direction relative to the listener. An auditory scene therefore may be viewed as the net effect of simultaneously hearing audio signals generated by one or more audio sources located at various positions relative to the listener.
The mere fact that the human brain can process a binaural input signal in order to ascertain the position and direction of a sound source can be used to encode and synthesise auditory scenes. A typical method of spatial auditory coding may thus attempt to model the salient features of an audio scene, by purposefully modifying audio signals from one or more different sources (channels). This may be for headphone use defined as left and right audio signals. These left and right audio signals may be collectively known as binaural signals. The resultant binaural signals may then be generated such that they give the perception of varying audio sources located at different positions relative to the listener. A binaural signal typically exhibits two properties not necessarily present in a conventional stereo signal. Firstly, a binaural signal has incorporated the time difference between left and right and, secondly, the binaural signal models the so called “head shadow effect”, which results in a reduction of volume for certain frequency bands.
Recently, spatial audio techniques have been used in connection with multichannel audio reproduction. The objective of multichannel audio reproduction is to provide for efficient coding of multi channel audio signals comprising a plurality of separate audio channels and/or sound sources. Recent approaches to the coding of multichannel audio signals have centred on the methods of parametric stereo (PS), such as Binaural Cue Coding (BCC). BCC typically encodes the multi-channel audio signal by down mixing the input audio signals into either a single (“sum”) channel or a reduced number of channels conveying the “sum” signal. The “sum” signal may be also referred to as a downmix signal. In parallel, the most salient inter channel cues, otherwise known as spatial cues, describing the multi-channel sound image or audio scene are extracted from the input channels and encoded as side information. Both the sum signal and side information from the encoded parameter set which can then either be transmitted as part of a communication chain or stored in a store and forward type device. Many implementations of the BCC technique employ a low bit rate audio coding scheme to further encode the sum signal. Subsequently, the BCC decoder generates a multi-channel output signal from the transmitted or stored sum signal and spatial cue information. Typically “sum” signals (i.e. downmix signals) employed in spatial audio coding systems are additionally encoded using low bit rate perceptual audio coding techniques, such as Advanced Audio Coding (AAC) or ITU-T Recommendation G.718 to further reduce the required bit rate.
In stereo coding of audio signals two audio channels are encoded. In many cases the audio channels may have rather similar content at least part of a time. Therefore, compression of the audio signals can be performed efficiently by coding the channels together. This results in overall bit rate which can be lower than the bit rate required for coding channels independently.
A commonly used low bit rate stereo coding method is known as the parametric stereo coding. In parametric stereo coding a stereo signal is encoded using a mono coder and parametric representation of the stereo signal. The parametric stereo encoder computes a mono signal as a linear combination of the input signals. The mono signal may be encoded using conventional mono audio encoder. In addition to creating and coding the mono signal, the encoder extracts parametric representation of the stereo signal. Parameters may include information on level differences, phase (or time) differences and coherence between input channels. In the decoder side this parametric information is utilized to recreate stereo signal from the decoded mono signal. Parametric stereo is an improved version of the intensity stereo coding, in which only the level differences between channels are extracted.
Another common stereo coding method, especially for higher bit rates, is known as mid-side stereo, which can be abbreviated as M/S stereo. Mid-side stereo coding transforms the left and right channels into a mid channel and a side channel. The mid channel is the sum of the left and right channels, whereas the side channel is the difference of the left and right channels. These two channels are encoded independently. With accurate enough quantization mid-side stereo retains the original audio image relatively well without introducing severe artifacts. On the other hand, for good quality reproduced audio the required bit rate remains at quite a high level.
In many cases stereo signals are generated artificially by panning different sound sources to two channels. In these cases there typically are not time delays between channels, and the signals can be efficiently encoded using for example parametric or mid-side coding.
A special case of a stereo signal is a binaural signal. A binaural audio signal may be recorded for example by using microphones mounted in an artificial head or with a real user wearing a head set with microphones in the close proximity of his/her ears, or by using other real recording arrangement with two microphones close to each other. These kind of signals can also be artificially generated. For example, binaural signals can be generated by applying suitable head related transfer functions (HRTF) or corresponding head related impulse responses (HRIR) to a source signal. All these discussed signals have one special feature not typically present in generic two-channel audio: both channels contain in principle the same source signals with a different time delay and frequency dependent amplification. Time delay is dependent on the direction of arrival of the sound. In the following, all these kinds of signals are referred as binaural audio.
One problem is how to reduce the number of bits needed to encode good quality binaural audio. Mid-side stereo coding and parametric stereo coding techniques do not perform well, as they may not take into consideration time delays between channels. In case of parametric stereo, the time delay information may be totally lost. Mid-side stereo, on the other hand, may require high bit rate for binaural signals for good quality. For maximum compression with good quality, binaural audio specific coding method should be used.
It is feasible to think that two binaural channels can be efficiently joined into one channel, such as in parametric stereo coding, if the signals can first be time aligned, i.e. the time delays between channels are removed. Similarly, the time differences can be restored in the decoder. Alternatively, the time aligned signals can be used for improving the efficiency of mid-side stereo coding.
One difficulty in time alignment lies in the fact that the time differences between channels of an input signal may be different for different time and frequency locations. In addition, there may be several source signals occupying the same time-frequency location. Further, the time alignment has to be performed carefully because if time shifts are not performed cautiously, perceptual problems may arise.