The present application is related to stereo processing or, generally, multi-channel processing, where a multi-channel signal has two channels such as a left channel and a right channel in the case of a stereo signal or more than two channels, such as three, four, five or any other number of channels.
Stereo speech and particularly conversational stereo speech has received much less scientific attention than storage and broadcasting of stereophonic music. Indeed in speech communications monophonic transmission is still nowadays mostly used. However with the increase of network bandwidth and capacity, it is envisioned that communications based on stereophonic technologies will become more popular and bring a better listening experience.
Efficient coding of stereophonic audio material has been for a long time studied in perceptual audio coding of music for efficient storage or broadcasting. At high bitrates, where waveform preserving is crucial, sum-difference stereo, known as mid/side (M/S) stereo, has been employed for a long time. For low bit-rates, intensity stereo and more recently parametric stereo coding has been introduced. The latest technique was adopted in different standards as HeAACv2 and Mpeg USAC. It generates a down-mix of the two-channel signal and associates compact spatial side information.
Joint stereo coding are usually built over a high frequency resolution, i.e. low time resolution, time-frequency transformation of the signal and is then not compatible to low delay and time domain processing performed in most speech coders. Moreover the engendered bit-rate is usually high.
On the other hand, parametric stereo employs an extra filter-bank positioned in the front-end of the encoder as pre-processor and in the back-end of the decoder as post-processor. Therefore, parametric stereo can be used with conventional speech coders like ACELP as it is done in MPEG USAC. Moreover, the parametrization of the auditory scene can be achieved with minimum amount of side information, which is suitable for low bit-rates. However, parametric stereo is as for example in MPEG USAC not specifically designed for low delay and does not deliver consistent quality for different conversational scenarios. In conventional parametric representation of the spatial scene, the width of the stereo image is artificially reproduced by a decorrelator applied on the two synthesized channels and controlled by Inter-channel Coherence (ICs) parameters computed and transmitted by the encoder. For most stereo speech, this way of widening the stereo image is not appropriate for recreating the natural ambience of speech which is a pretty direct sound since it is produced by a single source located at a specific position in the space (with sometimes some reverberation from the room). By contrast, music instruments have much more natural width than speech, which can be better imitated by decorrelating the channels.
Problems also occur when speech is recorded with non-coincident microphones, like in A-B configuration when microphones are distant from each other or for binaural recording or rendering. Those scenarios can be envisioned for capturing speech in teleconferences or for creating a virtually auditory scene with distant speakers in the multipoint control unit (MCU). The time of arrival of the signal is then different from one channel to the other unlike recordings done on coincident microphones like X-Y (intensity recording) or M-S (Mid-Side recording). The computation of the coherence of such non time-aligned two channels can then be wrongly estimated which makes fail the artificial ambience synthesis.
Conventional technology references related to stereo processing are U.S. Pat. Nos. 5,434,948 or 8,811,621.
Document WO 2006/089570 A1 discloses a near-transparent or transparent multi-channel encoder/decoder scheme. A multi-channel encoder/decoder scheme additionally generates a waveform-type residual signal. This residual signal is transmitted together with one or more multi-channel parameters to a decoder. In contrast to a purely parametric multi-channel decoder, the enhanced decoder generates a multi-channel output signal having an improved output quality because of the additional residual signal. On the encoder-side, a left channel and a right channel are both filtered by an analysis filterbank. Then, for each subband signal, an alignment value and a gain value are calculated for a subband. Such an alignment is then performed before further processing. On the decoder-side, a de-alignment and a gain processing is performed and the corresponding signals are then synthesized by a synthesis filterbank in order to generate a decoded left signal and a decoded right signal.
In such stereo processing applications, the calculation of an inter-channel or inter channel time difference between a first channel signal and a second channel signal is useful in order to typically perform a broadband time alignment procedure. However, other applications do exist for the usage of an inter-channel time difference between a first channel and a second channel, where these applications are in storage or transmission of parametric data, stereo/multi-channel processing comprising a time alignment of two channels, a time difference of arrival estimation for a determination of a speaker position in a room, beamforming spatial filtering, foreground/background decomposition or the location of a sound source by, for example, acoustic triangulation in order to only name a few.
For all such applications, an efficient, accurate and robust determination of an inter-channel time difference between a first and a second channel signal may be used.
There do already exist such determinations known under the term “GCC-PHAT” or, stated differently, generalized cross-correlation phase transform. Typically, a cross-correlation spectrum is calculated between the two channel signals and, then, a weighting function is applied to the cross-correlation spectrum for obtaining a so-called generalized cross-correlation spectrum before performing an inverse spectral transform such as an inverse DFT to the generalized cross-correlation spectrum in order to find a time-domain representation. This time-domain representation represents values for certain time lags and the highest peak of the time-domain representation then typically corresponds to the time delay or time difference, i.e., the inter-channel time delay of difference between the two channel signals.
However, it has been shown that, particularly in signals that are different from, for example, clean speech without any reverberation or background noise, the robustness of this general technique is not optimum.