Recently developed parametric audio coding methods such as binaural cue coding (BCC) enable multi-channel and surround (spatial) audio coding and representation. The common aim of the parametric methods for coding of spatial audio is to represent the original audio as a downmix signal comprising a reduced number of audio channels, for example as a monophonic or as two channel (stereo) sum signal, along with associated spatial audio parameters describing the relationship between the channels of an original signal in order to enable reconstruction of the signal with a spatial image similar to that of the original signal. This kind of coding scheme allows extremely efficient compression of multi-channel signals with high audio quality.
The spatial audio parameters may, for example, comprise parameters descriptive of inter-channel level difference, inter-channel time difference and inter-channel coherence between one or more channel pairs and/or in one or more frequency bands. Furthermore, further or alternative spatial audio parameters such as direction of arrival can be used in addition to or instead of the inter-channel parameters discussed
Typically, spatial audio coding and corresponding downmix to mono or stereo requires reliable level and time difference estimation or an equivalent. The estimation of time difference of input channels is a dominant spatial audio parameter at low frequencies.
Conventional inter-channel analysis mechanisms may require a high computational load, especially when high audio sampling rates (48 kHz or even higher) are employed. Inter-channel time difference estimation mechanisms based on cross-correlation are computationally very costly due to the large amount of signal data.
Furthermore, if the audio is captured using a distributed sensor network and the spatial audio encoding is performed at a central server of the network, then each data channel between sensor and server may require a significant transmission bandwidth.
It is not possible to reduce bandwidth by simply reducing the audio sampling rate without losing information required in the subsequent processing stages.