Spatial or 3D audio is a generic formulation which denotes various kinds of multi-channel audio signals. Depending on the capturing and rendering methods, the audio scene is represented by a spatial audio format. Typical spatial audio formats defined by the capturing method (microphones) are for example denoted as stereo, binaural, ambisonics, etc. Spatial audio rendering systems (headphones or loudspeakers) often denoted as surround systems are able to render spatial audio scenes with stereo (left and right channels 2.0) or more advanced multi-channel audio signals (2.1, 5.1, 7.1, etc.).
Recently developed technologies for the transmission and manipulation of such audio signals allow the end user to have an enhanced audio experience with higher spatial quality often resulting in a better intelligibility as well as an augmented reality. Spatial audio coding techniques generate a compact representation of spatial audio signals which is compatible with data rate constraint applications such as streaming over the interne for example. The transmission of spatial audio signals is however limited when the data rate constraint is too strong and therefore post-processing of the decoded audio channels is also used to enhanced the spatial audio playback. Commonly used techniques are for example able to blindly up-mix decoded mono or stereo signals into multi-channel audio (5.1 channels or more).
In order to efficiently render spatial audio scenes, these spatial audio coding and processing technologies make use of the spatial characteristics of the multi-channel audio signal.
In particular, the time and level differences between the channels of the spatial audio capture such as the Inter-Channel Time Difference ICTD and the Inter-Channel Level Difference ICLD are used to approximate the interaural cues such as the Interaural Time Difference ITD and Interaural Level Difference ILD which characterize our perception of sound in space. The term “cue” is used in the field of sound localization, and normally means parameter or descriptor. The human auditory system uses several cues for sound source localization, including time- and level differences between the ears, spectral information, as well as parameters of timing analysis, correlation analysis and pattern matching.
FIG. 1 illustrates the underlying difficulty of modeling spatial audio signals with a parametric approach. The Inter-Channel Time and Level Differences (ICTD and ICLD) are commonly used to model the directional components of multi-channel audio signals while the Inter-Channel Correlation ICC—that models the InterAural Cross-Correlation IACC—is used to characterize the width of the audio image. Inter-Channel parameters such as ICTD, ICLD and ICC are thus extracted from the audio channels in order to approximate the ITD, ILD and IACC which model our perception of sound in space. Since the ICTD and ICLD are only an approximation of what our auditory system is able to detect (ITD and ILD at the ear entrances), it is of high importance that the ICTD cue is relevant from a perceptual aspect.
FIG. 2 is a schematic block diagram showing parametric stereo encoding/decoding as an illustrative example of multi-channel audio encoding/decoding. The encoder 10 basically comprises a downmix unit 12, a mono encoder 14 and a parameters extraction unit 16. The decoder 20 basically comprises a mono decoder 22, a decorrelator 24 and a parametric synthesis unit 26. In this particular example, the stereo channels are down-mixed by the downmix unit 12 into a sum signal encoded by the mono encoder 14 and transmitted to the decoder 20, 22 as well as the spatial quantized (sub-band) parameters extracted by the parameters extraction unit 16 and quantized by the quantizer Q. The spatial parameters may be estimated based on the sub-band decomposition of the input frequency transforms of the left and the right channel. Each sub-band is normally defined according to a perceptual scale such as the Equivalent Rectangular Bandwidth—ERB. The decoder and the parametric synthesis unit 26 in particular performs a spatial synthesis (in the same sub-band domain) based on the decoded mono signal from the mono decoder 22, the quantized (sub-band) parameters transmitted from the encoder 10 and a decorrelated version of the mono signal generated by the decorrelator 24. The reconstruction of the stereo image is then controlled by the quantized sub-band parameters. Since these quantized sub-band parameters are meant to approximate the spatial or interaural cues, it is very important that the Inter-Channel parameters (ICTD, ICLD and ICC) are extracted and transmitted according to perceptual considerations so that the approximation is acceptable for the auditory system.
Stereo and multi-channel audio signals are often complex signals difficult to model especially when the environment is noisy or when various audio components of the mixtures overlap in time and frequency i.e. noisy speech, speech over music or simultaneous talkers, and so forth.
Reference can for example be made to FIGS. 3A-B (clean speech analysis) and FIGS. 4A-B (noisy speech analysis) showing the decrease of the Cross-Correlation Function (CCF), which is typically normalized to the interval between −1 and 1, when interfering noise is mixed with the speech signal.
FIG. 3A illustrates an example of the waveforms for the left and right channels for “clean speech”. FIG. 3B illustrates a corresponding example of the Cross-Correlation Function between a portion of the left and right channels.
FIG. 4A illustrates an example of the waveforms for the left and right channels made up of a mixture of clean speech and artificial noise. FIG. 4B illustrates a corresponding example of the Cross-Correlation Function between a portion of the left and right channels.
The background noise has comparable energy to the speech signal as well as low correlation between the left and the right channels, and therefore the maximum of the CCF is not necessarily related to the speech content in such environmental conditions. This results in an inaccurate modeling of the speech signal which generates instability in the stream of extracted parameters. In that case, the time shift or delay (ICTD) that maximizes the CCF is irrelevant with respect to the maximum of the CCF i.e. Inter-Channel Correlation or Coherence (ICC). Such environmental conditions are frequently observed outdoors, in a car or even in an office environment with computer fans and so forth. This phenomenon requires extra precautions in order to provide a reliable and stable estimation of the Inter-Channel Time Difference (ICTD).
Voice activity detection or more precisely the detection of tonal components within the stereo channels is used in [1] to adapt the update rate of the ICTD over time. The ICTD is extracted on a time-frequency grid i.e. using a sliding analysis-window and sub-band frequency decomposition. The ICTD is smoothed over time according to the combination of the tonality measure and the level of correlation between the channels according to the ICC cue. The algorithm allows for a strong smoothing of the ICTD when the signal is detected as tonal and an adaptive smoothing of the ICTD using the ICC as a forgetting factor when the tonality measure is low. While the smoothing of the ICTD for exactly tonal components is acceptable, the use of a forgetting factor when the signals are not exactly tonal is questionable. Indeed, the lower the ICC cue, the stronger the smoothing of the ICTD, which makes the ICTD extraction very approximate and problematic especially when source(s) are moving in space. The assumption that a “low” ICC allows for a smoothing of the ICTD is not always true and is highly dependent on the environmental conditions i.e. level of noise, reverberation, background components etc. In other words, the algorithm described in [1] using smoothing of the ICTD over time does not allow for a precise tracking of the ICTD, especially not when the signal characteristics (ICC, ICTD and ICLD) evolve quickly in time.
There is a general need for an improved extraction or determination of the inter-channel time difference ICTD.