Stereo speech and particularly conversational stereo speech has received much less scientific attention than storage and broadcasting of stereophonic music. Indeed in speech communications monophonic transmission is still nowadays mostly used. However with the increase of network bandwidth and capacity, it is envisioned that communications based on stereophonic technologies will become more popular and bring a better listening experience.
Efficient coding of stereophonic audio material has been for a long time studied in perceptual audio coding of music for efficient storage or broadcasting. At high bitrates, where waveform preserving is crucial, sum-difference stereo, known as mid/side (M/S) stereo, has been employed for a long time. For low bit-rates, intensity stereo and more recently parametric stereo coding has been introduced. The latest technique was adopted in different standards as HeAACv2 and Mpeg USAC. It generates a downmix of the two-channel signal and associates compact spatial side information.
Joint stereo coding are usually built over a high frequency resolution, i.e. low time resolution, time-frequency transformation of the signal and is then not compatible to low delay and time domain processing performed in most speech coders. Moreover the engendered bit-rate is usually high.
On the other hand, parametric stereo employs an extra filter-bank positioned in the front-end of the encoder as pre-processor and in the back-end of the decoder as post-processor. Therefore, parametric stereo can be used with conventional speech coders like ACELP as it is done in MPEG USAC. Moreover, the parametrization of the auditory scene can be achieved with minimum amount of side information, which is suitable for low bit-rates. However, parametric stereo is as for example in MPEG USAC not specifically designed for low delay and does not deliver consistent quality for different conversational scenarios. In conventional parametric representation of the spatial scene, the width of the stereo image is artificially reproduced by a decorrelator applied on the two synthesized channels and controlled by Inter-channel Coherence (ICs) parameters computed and transmitted by the encoder. For most stereo speech, this way of widening the stereo image is not appropriate for the recreating the natural ambience of speech which is a pretty direct sound since it is produced by a single source located at a specific position in the space (with sometimes some reverberation from the room). By contrast, music instruments have much more natural width than speech, which can be better imitated by decorrelating the channels.
Problems also occur when speech is recorded with non-coincident microphones, like in A-B configuration when microphones are distant from each other or for binaural recording or rendering. Those scenarios can be envisioned for capturing speech in teleconferences or for creating a virtually auditory scene with distant speakers in the multipoint control unit (MCU). The time of arrival of the signal is then different from one channel to the other unlike recordings done on coincident microphones like X-Y (intensity recording) or M-S (Mid-Side recording). The computation of the coherence of such non time-aligned two channels can then be wrongly estimated which makes fail the artificial ambience synthesis.
Conventional-technology references related to stereo processing are U.S. Pat. No. 5,434,948 or 8,811,621.
Document WO 2006/089570 A1 discloses a near-transparent or transparent multi-channel encoder/decoder scheme. A multi-channel encoder/decoder scheme additionally generates a waveform-type residual signal. This residual signal is transmitted together with one or more multi-channel parameters to a decoder. In contrast to a purely parametric multi-channel decoder, the enhanced decoder generates a multi-channel output signal having an improved output quality because of the additional residual signal. On the encoder-side, a left channel and a right channel are both filtered by an analysis filter-bank. Then, for each subband signal, an alignment value and a gain value are calculated for a subband. Such an alignment is then performed before further processing. On the decoder-side, a de-alignment and a gain processing is performed and the corresponding signals are then synthesized by a synthesis filter-bank in order to generate a decoded left signal and a decoded right signal.
On the other hand, parametric stereo employs an extra filter-bank positioned in the front-end of the encoder as pre-processor and in the back-end of the decoder as post-processor. Therefore, parametric stereo can be used with conventional speech coders like ACELP as it is done in MPEG USAC. Moreover, the parametrization of the auditory scene can be achieved with minimum amount of side information, which is suitable for low bit-rates. However, parametric stereo is as for example in MPEG USAC not specifically designed for low delay and the overall system shows a very high algorithmic delay.