Parametric stereo or multi-channel audio coding as described e.g. in C. Faller and F. Baumgarte, “Efficient representation of spatial audio using perceptual parametrization,” in Proc. IEEE Workshop on Appl. of Sig. Proc. to Audio and Acoust., October 2001, pp. 199-202, uses spatial cues to synthesize multi-channel audio signals from down-mix—usually mono or stereo—audio signals, the multi-channel audio signals having more channels than the down-mix audio signals. Usually, the down-mix audio signals result from a superposition of a plurality of audio channel signals of a multi-channel audio signal, e.g. of a stereo audio signal. These less channels are waveform coded and side information, i.e. the spatial cues, related to the original signal channel relations is added as encoding parameters to the coded audio channels. The decoder uses this side information to re-generate the original number of audio channels based on the decoded waveform coded audio channels.
A basic parametric stereo coder may use inter-channel level differences (ILD) as a cue needed for generating the stereo signal from the mono down-mix audio signal. More sophisticated coders may also use the inter-channel coherence (ICC), which may represent a degree of similarity between the audio channel signals, i.e. audio channels. Furthermore, when coding binaural stereo signals e.g. for 3D audio or headphone based surround rendering, an inter-channel phase difference (IPD) may also play a role to reproduce phase/delay differences between the channels.
The inter-aural time difference (ITD) is the difference in arrival time of a sound 701 between two ears 703, 705 as can be seen from FIG. 7. It is important for the localization of sounds, as it provides a cue to identify the direction 707 or angle (theta) of incidence of the sound source 701 (relative to the head 709). If a signal arrives to the ears 703, 705 from one side, the signal has a longer path 711 to reach the far ear 703 (contralateral) and a shorter path 713 to reach the near ear 705 (ipsilateral). This path length difference results in a time difference 715 between the sounds arrivals at the ears 703, 705, which is detected and aids the process of identifying the direction 707 of sound source 701.
FIG. 7 gives an example of ITD (denoted as Δt or time difference 715). Differences in time of arrival at the two ears 703, 705 are indicated by a delay of the sound waveform. If a waveform to left ear 703 comes first, the ITD 715 is positive, otherwise, it is negative. If the sound source 701 is directly in front of the listener, the waveform arrives at the same time to both ears 703, 705 and the ITD 715 is thus zero.
ITD cues are important for most of the stereo recording. For instance, binaural audio signal, which can be obtained from real recording using for instance a dummy head or binaural synthesis based on Head Related Transfer Function (HRTF) processing, is used for music recording or audio conferencing. Therefore, it is a very important parameter for low bitrate parametric stereo codec and especially for codec targeting conversational application. Low complexity and stable ITD estimation algorithm is needed for low bitrate parametric stereo codec. Furthermore, the use of ITD parameters, e.g. in addition to other parameters, such as inter-channel level differences (CLDs or ILDs) and inter-channel coherence (ICC), may increase the bitrate overhead. For this specific very low bitrate scenario, only one full band ITD parameter can be transmitted. When only one full band ITD is estimated, the constraint on stability becomes even more difficult to achieve.
In prior art, ITD estimation methods can be classified into three main categories.
ITD estimation may be based on time domain methods. ITD is estimated based on the time domain cross correlation between channels ITD corresponds to the delay where time domain cross correlation(f*g)[n]Σm=−∞∞f*[m]g[n+m]
is maximum. This method provides a non-stable estimation of the delay over several frames. This is particularly true when the input signals f and g are wide-band signals with complex sound scene as different sub-band signals may have different ITD values. A non-stable ITD may result in introducing a click (noise) when delay is switched for consecutive frames in the decoder. When this time domain analysis is performed on the full band signal, the bitrate of time domain ITD estimation is low, since only one ITD is estimated, coded and transmitted. However, the complexity is very high, due to the cross-correlation calculation on signals with high sampling frequency.
The second category of ITD estimation method is based on a combination of frequency and time domain approaches. In Marple, S. L., Jr.; “Estimating group delay and phase delay via discrete-time “analytic” cross-correlation,” Signal Processing, IEEE Transactions on, vol. 47, no. 9, pp. 2604-2607, September 1999, the frequency and time domain ITD estimation contains the following steps:                1. Fast Fourier Transform (FFT) analysis is applied to the input signals in order to get frequency coefficients.        2. Cross-correlation is calculated in the frequency domain.        3. Frequency domain cross correlation is converted to time domain using an inverse FFT.        4. The ITD is estimated in complex time domain.        
This method can also achieve the constraint of low bitrate, since only one full band ITD is estimated, coded and transmitted. However, the complexity is very high, due to the cross-correlation calculation, and inverse FFT which makes this method not applicable when the computational complexity is limited.
Finally, the last category performs the ITD estimation directly in the frequency domain. In Baumgarte, F.; Faller, C.; “Binaural cue coding-Part I: psychoacoustic fundamentals and design principles,” Speech and Audio Processing, IEEE Transactions on, vol. 11, no. 6, pp. 509-519, November 2003 and in Faller, C.; Baumgarte, F.; “Binaural cue coding-Part II: Schemes and applications,” Speech and Audio Processing, IEEE Transactions on, vol. 11, no. 6, pp. 520-531, November 2003, ITD is estimated in frequency domain, and for each frequency band, an ITD is coded and transmitted. The complexity of this solution is limited, but the required bitrate for this method is high, as one ITD per sub-band has to be transmitted.
Moreover, the reliability and stability of the estimated ITD depend on the frequency bandwidth of the sub-band signal as for large sub-band ITD might not be consistent (different audio sources with different positions might be present in the band limited audio signal).
The very low bitrate parametric multichannel audio coding schemes have not only the constraint on bitrate, but also limitation on available complexity especially for codec targeting implementation in mobile terminal where the battery life must be saved. The state of the art ITD estimation algorithms cannot meet both requirements on low bitrate and low complexity at the same time while maintaining a good quality in terms of stability of the ITD estimation.