Spatial audio scene consists of audio sources and ambience around a listener. The ambience component of a spatial audio scene may comprise ambient background noise caused by the room effect, i.e. the reverberation of the audio sources due to the properties of the space the audio sources are located, and/or other ambient sound source(s) within and/or the auditory space. The auditory image is perceived due to the directions of arrival of sound from the audio sources as well as the reverberation. A human being is able to capture the three dimensional image using signals from the left and the right ear. Hence, recording the audio image using microphones placed close to ear drums is sufficient to capture the spatial audio image.
In stereo coding of audio signals two audio channels are encoded. In many cases the audio channels may have rather similar content at least part of a time. Therefore, compression of the audio signals can be performed efficiently by coding the channels together. This results in overall bit rate, which can be lower than the bit rate required for coding channels independently.
A commonly used low bit rate stereo coding method is known as the parametric stereo coding. In parametric stereo coding a stereo signal is encoded using a mono coder and parametric representation of the stereo signal. The parametric stereo encoder computes a mono signal as a linear combination of the input signals. The combination of input signals is also referred to as a downmix signal. The mono signal may be encoded using conventional mono audio encoder. In addition to creating and coding the mono signal, the encoder extracts parametric representation of the stereo signal. Parameters may include information on level differences, phase (or time) differences and coherence between input channels. In the decoder side this parametric information is utilized to recreate stereo signal from the decoded mono signal. Parametric stereo can be considered an improved version of the intensity stereo coding, in which only the level differences between channels are extracted.
Parametric stereo coding can be generalized into multi-channel coding of any number of channels. In a general case with any number of input channels, a parametric encoding process provides a downmix signal having number of channels smaller than the input signal, and parametric representation providing information on (for example) level/phase differences and coherence between input channels to enable reconstruction of a multi-channel signal based on the downmix signal.
Another common stereo coding method, especially for higher bit rates, is known as mid-side stereo, which can be abbreviated as M/S stereo. Mid-side stereo coding transforms the left and right channels into a mid channel and a side channel. The mid channel is the sum of the left and right channels, whereas the side channel is the difference of the left and right channels. These two channels are encoded independently. With accurate enough quantization mid-side stereo retains the original audio image relatively well without introducing severe artifacts. On the other hand, for good quality reproduced audio the required bit rate remains at quite a high level.
Like parametric coding, also M/S coding can be generalized from stereo coding into multi-channel coding of any number of channels. In the multi-channel case, M/S coding is typically performed to channel pairs. For example, in 5.1 channel configuration, the front left and front right channels may form a first pair and coded using a M/S scheme and the rear left and rear right channels may form a second pair and are also coded using a M/S scheme.]
There is a number of applications that benefit from efficient multi-channel audio processing and coding capability, for example “surround sound” making use of 5.1 or 7.1 channel formats. Another example that benefits from efficient multi-channel audio processing and coding is a multi-view audio processing system, which may comprise for example multi-view audio capture, analysis, encoding, decoding/reconstruction and/or rendering components. In a multi-view audio processing system a signal obtained e.g. from multiple, closely spaced microphones all of which are pointing toward different angles relative to the forward axis are used to capture the audio scene. The captured signals are possibly processed and then transmitted (or alternatively stored for later consumption) to the rendering side where the end user can select the aural view based on his/her preference from the multiview audio scene. The rendering part then provides the downmixed signal(s) from the multiview audio scene that correspond to the selected aural view. To enable transmission over the network or storage in a storage medium, compression schemes may need to be applied to meet the constraints of the network or storage space requirements.
The data rates associated with the multiview audio scene are often so high that compression coding and related processing may be needed to the signals in order to enable transmission over a network or storage. Furthermore, a similar challenge regarding the required transmission bandwidth is naturally valid also for any multi-channel audio signal.
In general, multichannel audio is a subset of a multiview audio. To a certain extent multichannel audio coding solutions can be applied to the multiview audio scene although they are more optimized towards coding of standard loudspeaker arrangements such as two-channel stereo or 5.1 or 7.1 channel formats.
For example, the following multichannel audio coding solutions have been proposed. An advanced audio coding (AAC) standard defines a channel pairwise type of coding where the input channels are divided into channel pairs and efficient psycho acoustically guided coding is applied to each of the channel pairs. This type of coding is more targeted towards high bitrate coding. In general, the psycho acoustically guided coding focuses on keeping the quantization noise below the masking threshold, that is, inaudible to human ear. These models are typically computationally quite complex even with single channel signals not to mention multi-channel signals with relatively high number of input channels.
For low bitrate coding, many technical solutions have been tailored towards techniques where small amount of side information is added to the main signal. The main signal is typically the sum signal or some other linear combination of the input channels and the side information is used to enable spatilization of the main signal back to the multichannel signal at a decoding side.
While efficient in bitrate, these methods typically lack in the amount of ambience or spaciousness in the reconstructed signal. For the presence experience, that is, for the feeling of being there, it is important that the surrounding ambience is also faithfully restored at the receiving end for the listener.