Spatial or 3D audio is a generic formulation which denotes various kinds of multi-channel audio signals. Depending on the capturing and rendering methods, the audio scene is represented by a spatial audio format. Typical spatial audio formats defined by the capturing method (microphones) are for example denoted as stereo, binaural, ambisonics, etc. Spatial audio rendering systems (headphones or loudspeakers) often denoted as surround systems are able to render spatial audio scenes with stereo (left and right channels 2.0) or more advanced multi-channel audio signals (2.1, 5.1, 7.1, etc.).
Recently developed technologies for the transmission and manipulation of such audio signals allow the end user to have an enhanced audio experience with higher spatial quality often resulting in a better intelligibility as well as an augmented reality. Spatial audio coding techniques generate a compact representation of spatial audio signals which is compatible with data rate constraint applications such as streaming over the internet for example. The transmission of spatial audio signals is however limited when the data rate constraint is too strong and therefore post-processing of the decoded audio channels is also used to enhanced the spatial audio playback. Commonly used techniques are for example able to blindly up-mix decoded mono or stereo signals into multi-channel audio (5.1 channels or more).
In order to efficiently render spatial audio scenes, these spatial audio coding and processing technologies make use of the spatial characteristics of the multi-channel audio signal.
In particular, the time and level differences between the channels of the spatial audio capture such as the Inter-Channel Time Difference ICTD and the Inter-Channel Level Difference ICLD are used to approximate the interaural cues such as the Interaural Time Difference ITD and Interaural Level Difference ILD which characterize our perception of sound in space. The term “cue” is used in the field of sound localization, and normally means parameter or descriptor. The human auditory system uses several cues for sound source localization, including time- and level differences between the ears, spectral information, as well as parameters of timing analysis, correlation analysis and pattern matching.
FIG. 1 illustrates the underlying difficulty of modeling spatial audio signals with a parametric approach. The Inter-Channel Time and Level Differences (ICTD and ICLD) are commonly used to model the directional components of multi-channel audio signals while the Inter-Channel Correlation ICC—that models the InterAural Cross-Correlation IACC—is used to characterize the width of the audio image. Inter-Channel parameters such as ICTD, ICLD and ICC are thus extracted from the audio channels in order to approximate the ITD, ILD and IACC which model our perception of sound in space. Since the ICTD and ICLD are only an approximation of what our auditory system is able to detect (ITD and ILD at the ear entrances), it is of high importance that the ICTD cue is relevant from a perceptual aspect.
FIG. 2 is a schematic block diagram showing parametric stereo encoding/decoding as an illustrative example of multi-channel audio encoding/decoding. The encoder 10 basically comprises a downmix unit 12, a mono encoder 14 and a parameters extraction unit 16. The decoder 20 basically comprises a mono decoder 22, a decorrelator 24 and a parametric synthesis unit 26. In this particular example, the stereo channels are down-mixed by the downmix unit 12 into a sum signal encoded by the mono encoder 14 and transmitted to the decoder 20, 22 as well as the spatial quantized (sub-band) parameters extracted by the parameters extraction unit 16 and quantized by the quantizer Q. The spatial parameters may be estimated based on the sub-band decomposition of the input frequency transforms for the left and the right channel. Each sub-band is normally defined according to a perceptual scale such as the Equivalent Rectangular Bandwidth—ERB. The decoder and the parametric synthesis unit 26 in particular performs a spatial synthesis (in the same sub-band domain) based on the decoded mono signal from the mono decoder 22, the quantized (sub-band) parameters transmitted from the encoder 10 and a decorrelated version of the mono signal generated by the decorrelator 24. The reconstruction of the stereo image is then controlled by the quantized sub-band parameters. Since these quantized sub-band parameters are meant to approximate the spatial or binaural cues, it is very important that the Inter-Channel parameters (ICTD, ICLD and ICC) are extracted and transmitted according to perceptual considerations so that the approximation is acceptable for the auditory system.
Stereo and multi-channel audio signals are often complex signals difficult to model especially when the environment is noisy or when various audio components of the mixtures overlap in time and frequency i.e. noisy speech, speech over music or simultaneous talkers, and so forth. Multi-channel audio signals made up of few sound components can also be difficult to model especially with the use of a parametric approach.
There is thus a general need for improved extraction or determination of the inter-channel time difference ICTD.