Any discussion of the background art throughout the specification should in no way be considered as an admission that such art is widely known or forms part of common general knowledge in the field.
The content creation, coding, distribution and reproduction of audio content is traditionally channel based. That is, one specific target playback system is envisioned for content throughout the content ecosystem. Examples of such target playback systems are mono, stereo, 5.1, 7.1, 7.1.4, and the like.
If content is to be reproduced on a different playback system than the intended one, down-mixing or up-mixing can be applied. For example, 5.1 content can be reproduced over a stereo playback system by employing specific known down-mix equations. Another example is playback of stereo content over a 7.1 speaker setup, which may comprise a so-called up-mixing process that could or could not be guided by information present in the stereo signal such as used by so-called matrix encoders such as Dolby Pro Logic. To guide the up-mixing process, information on the original position of signals before down-mixing can be signaled implicitly by including specific phase relations in the down-mix equations, or said differently, by applying complex-valued down-mix equations. A well-known example of such down-mix method using complex-valued down-mix coefficients for content with speakers placed in two dimensions is LtRt (Vinton et al. 2015).
The resulting (stereo) down-mix signal can be reproduced over a stereo loudspeaker system, or can be up-mixed to loudspeaker setups with surround and/or height speakers. The intended location of the signal can be derived by an up-mixer from the inter-channel phase relationships. For example, in an LtRt stereo representation, a signal that is out-of-phase (e.g., has an inter-channel waveform normalized cross-correlation coefficient close to −1) should ideally be reproduced by one or more surround speakers, while a positive correlation coefficient (close to +1) indicates that the signal should be reproduced by speakers in front of the listener.
A variety of up-mixing algorithms and strategies have been developed that differ in their strategies to recreate a multi-channel signal from the stereo down-mix. In relatively simple up-mixers, the normalized cross-correlation coefficient of the stereo waveform signals is tracked as a function of time, while the signal(s) are steered to the front or rear speakers depending on the value of the normalized cross-correlation coefficient. This approach works well for relatively simple content in which only one auditory object is present simultaneously. More advanced up-mixers are based on statistical information that is derived from specific frequency regions to control the signal flow from stereo input to multi-channel output (Gundry 2001, Vinton et al. 2015). Specifically, a signal model based on a steered or dominant component and a stereo (diffuse) residual signal can be employed in individual time/frequency tiles. Besides estimation of the dominant component and residual signals, a direction (in azimuth, possibly augmented with elevation) angle is estimated as well, and subsequently the dominant component signal is steered to one or more loudspeakers to reconstruct the (estimated) position during playback.
The use of matrix encoders and decoders/up-mixers is not limited to channel-based content. Recent developments in the audio industry are based on audio objects rather than channels, in which one or more objects consist of an audio signal and associated metadata indicating, among other things, its intended position as a function of time. For such object-based audio content, matrix encoders can be used as well, as outlined in Vinton et al. 2015. In such a system, object signals are down-mixed into a stereo signal representation with down-mix coefficients that are dependent on the object positional metadata.
The up-mixing and reproduction of matrix-encoded content is not necessarily limited to playback on loudspeakers. The representation of a steered or dominant component consisting of a dominant component signal and (intended) position allows reproduction on headphones by means of convolution with head-related impulse responses (HRIRs) (Wightman et al, 1989). A simple schematic of a system implementing this method is shown 1 in FIG. 1. The input signal 2, in a matrix encoded format, is first analyzed 3 to determine a dominant component direction and magnitude. The dominant component signal is convolved 4, 5 by means of a pair of HRIRs derived from a lookup 6 based on the dominant component direction, to compute an output signal for headphone playback 7 such that the play back signal is perceived as coming from the direction that was determined by the dominant component analysis stage 3. This scheme can be applied on wide-band signals as well as on individual subbands, and can be augmented with dedicated processing of residual (or diffuse) signals in various ways.
The use of matrix encoders is very suitable for distribution to and reproduction on AV receivers, but can be problematic for mobile applications requiring low transmission data rates and low power consumption.
Irrespective of whether channel or object-based content is used, matrix encoders and decoders rely on fairly accurate inter-channel phase relationships of the signals that are distributed from matrix encoder to decoder. In other words, the distribution format should be largely waveform preserving. Such dependency on waveform preservation can be problematic in bit-rate constrained conditions, in which audio codecs employ parametric methods rather than waveform coding tools to obtain a better audio quality. Examples of such parametric tools that are generally known not to be waveform preserving are often referred to as spectral band replication, parametric stereo, spatial audio coding, and the like as implemented in MPEG-4 audio codecs (ISO/IEC 14496-3:2009).
As outlined in the previous section, the up-mixer consists of analysis and steering (or HRIR convolution) of signals. For powered devices, such as AV receivers, this generally does not cause problems, but for battery-operated devices such as mobile phones and tablets, the computational complexity and corresponding memory requirements associated with these processes are often undesirable because of their negative impact on battery life.
The aforementioned analysis typically also introduces additional audio latency. Such audio latency is undesirable because (1) it requires video delays to maintain audio-video lip sync requiring a significant amount of memory and processing power, and (2) may cause asynchrony/latency between head movements and audio rendering in the case of head tracking.
The matrix-encoded down-mix may also not sound optimal on stereo loudspeakers or headphones, due to the potential presence of strong out-of-phase signal components.