The present disclosure relates to audio coding and in particular to parametric multi-channel or stereo audio coding also known as parametric spatial audio coding.
Parametric stereo or multi-channel audio coding as described e.g. in C. Faller and F. Baumgarte, “Efficient representation of spatial audio using perceptual parametrization,” in Proc. IEEE Workshop on Appl. of Sig. Proc. to Audio and Acoust., October 2001, pp. 199-202, uses spatial cues to synthesize multi-channel audio signals from down-mix—usually mono or stereo—audio signals, the multi-channel audio signals having more channels than the down-mix audio signals. Usually, the down-mix audio signals result from a superposition of a plurality of audio channel signals of a multi-channel audio signal, e.g. of a stereo audio signal. These less channels are waveform coded and side information, i.e. the spatial cues, related to the original signal channel relations is added as encoding parameters to the coded audio channels. The decoder uses this side information to re-generate the original number of audio channels based on the decoded waveform coded audio channels.
A basic parametric stereo coder may use inter-channel level differences (ILD or CLD) as a cue needed for generating the stereo signal from the mono down-mix audio signal. More sophisticated coders may also use the inter-channel coherence (ICC), which may represent a degree of similarity between the audio channel signals, i.e. audio channels. Furthermore, when coding binaural stereo signals e.g. for 3D audio or headphone based surround rendering by using head-related transfer function (HRTF) filtering, an inter-aural time difference (ITD) may play a role to reproduce delay differences between the channels.
The inter-aural time difference (ITD) is the difference in arrival time of a sound 801 between two ears 803, 805 as can be seen from FIG. 8. It is important for the localization of sounds, as it provides a cue to identify the direction 807 or angle of incidence of the sound source 801 (relative to the head 809). If a signal arrives to the ears 803, 805 from one side, the signal has a longer path 811 to reach the far ear 803 (contralateral) and a shorter path 813 to reach the near ear 805 (ipsilateral). This path length difference results in a time difference 815 between the sounds arrivals at the ears 803, 805, which is detected and aids the process of identifying the direction 807 of sound source 801.
FIG. 8 gives an example of ITD (denoted as Δt or time difference 815). Differences in time of arrival at the two ears 803, 805 are indicated by a delay of the sound waveform. If a waveform to left ear 803 comes first, the ITD 815 is positive, otherwise, it is negative. If the sound source 801 is directly in front of the listener, the waveform arrives at the same time to both ears 803, 805 and the ITD 815 is thus zero.
ITD cues are important for most of the stereo recording. For instance, binaural audio signal, which can be obtained from real recording using for instance a dummy head or binaural synthesis based on Head Related Transfer Function (HRTF) processing, is used for music recording or audio conferencing. Therefore, it is a very important parameter for low bitrate parametric stereo codec and especially for codec targeting conversational application. Low complexity and stable ITD estimation algorithm is needed for low bitrate parametric stereo codec. Furthermore, the use of ITD parameters, e.g. in addition to other parameters, such as inter-channel level differences (CLDs or ILDs) and inter-channel coherence (ICC), may increase the bitrate overhead. For this specific very low bitrate scenario, only one full band ITD parameter can be transmitted. When only one full band ITD is estimated, the constraint on stability becomes even more difficult to achieve.
When a parameter is estimated by using a cross-correlation, a cross spectrum or an energy, the rapid change of the estimation function may lead to unstable estimation of the parameter. The estimated parameter might change too quickly and too frequently from frame to frame, which is usually not wanted. This can be the case if the size of the frame is small which can lead to a non-reliable estimator of the cross-correlation. The instability problem will be perceived as a source which seems to be jumping from the left to right side and/or vice versa although the actual source does not change its position. The instability problem can also be detected by a listener even if the source position does not jump from left side to right side. Small source position changes over time are easily perceived by a listener and should then be avoided when the actual source is fixed.
For example, the inter-aural time difference (ITD) is an important parameter for parametric stereo codec. If the ITD is estimated in the frequency domain based on the computation of a cross correlation function, the estimated ITD is usually not stable over consecutive frames, even if the position of sound source is fixed and the real ITD is stable. Stability problems can be solved by applying a smoothing function to the cross-correlation before using it for the ITD estimation. However, when smoothing the cross-correlation, rapid changes of the actual ITD cannot be followed. Besides, a stable smoothing reduces the tracking behavior of quickly following ITD changes when the sound source or the listening position moves with respect to each other.
Another example is channel level difference (CLD) estimation. CLD is an important parameter for parametric stereo codec. If the CLD is estimated in the frequency domain based on the computation of the energy of each bin or sub-band, the estimated CLD is usually not stable over consecutive frames, even if the position of sound source is fixed and the real level difference is stable. Stability problems can be solved by applying a smoothing function to the energy before using it for the CLD estimation. However, when smoothing the energy, rapid changes of the actual CLD cannot be followed thereby reducing the tracking behavior of quickly following CLD changes when the sound source or the listening position move with respect to each other.
Finding the right smoothing coefficients which allow to quickly follow the ITD or CLD changes while keeping the ITD or CLD stable has shown to be impossible, especially when the correlation function has a poor resolution, for instance the frequency resolution of an FFT.