Any discussion of the background art throughout the specification should in no way be considered as an admission that such art is widely known or forms part of common general knowledge in the field.
Content creation, coding, distribution and reproduction of audio are traditionally performed in a channel based format, that is, one specific target playback system is envisioned for content throughout the content ecosystem. Examples of such target playback systems audio formats are mono, stereo, 5.1, 7.1, and the like.
If content is to be reproduced on a different playback system than the intended one, a downmixing or upmixing process can be applied. For example, 5.1 content can be reproduced over a stereo playback system by employing specific downmix equations. Another example is playback of stereo encoded content over a 7.1 speaker setup, which may comprise a so-called upmixing process, that could or could not be guided by information present in the stereo signal. A system capable of upmixing is Dolby Pro Logic from Dolby Laboratories Inc (Roger Dressler, “Dolby Pro Logic Surround Decoder, Principles of Operation”, www.Dolby.com).
When stereo or multi-channel content is to be reproduced over headphones, it is often desirable to simulate a multi-channel speaker setup by means of head-related impulse responses (HRIRs), or binaural room impulse responses (BRIRs), which simulate the acoustical pathway from each loudspeaker to the ear drums, in an anechoic or echoic (simulated) environment, respectively. In particular, audio signals can be convolved with HRIRs or BRIRs to re-instate inter-aural level differences (ILDs), inter-aural time differences (ITDs) and spectral cues that allow the listener to determine the location of each individual channel. The simulation of an acoustic environment (reverberation) also helps to achieve a certain perceived distance.
Sound Source Localization and Virtual Speaker Simulation
When stereo, multi-channel or object-based content is to be reproduced over headphones, it is often desirable to simulate a multi-channel speaker setup or a set of discrete virtual acoustic objects by means of convolution with head-related impulse responses (HRIRs), or binaural room impulse responses (BRIRs), which simulate the acoustical pathway from each loudspeaker to the ear drums, in an anechoic or echoic (simulated) environment, respectively.
In particular, audio signals are convolved with HRIRs or BRIRs to re-instate inter-aural level differences (ILDs), inter-aural time differences (ITDs) and spectral cues that allow the listener to determine the location of each individual channel or object. The simulation of an acoustic environment (early reflections and late reverberation) helps to achieve a certain perceived distance.
Turning to FIG. 1, there is illustrated 10, a schematic overview is of the processing flow for rendering two object or channel signals xi 13, 11, being read out of a content store 12 for processing by 4 HRIRs e.g. 14. The HRIR outputs are then summed 15, 16, for each channel signal, so as to produce headphone speaker outputs for playback to a listener via headphones 18. The basic principle of HRIRs is, for example, explained in Wightman et al (1989).
The HRIR/BRIR convolution approach comes with several drawbacks, one of them being the substantial amount of processing that is required for headphone playback. The HRIR or BRIR convolution needs to be applied for every input object or channel separately, and hence complexity typically grows linearly with the number of channels or objects. As headphones are typically used in conjunction with battery-powered portable devices, a high computational complexity is not desirable as it will substantially shorten battery life. Moreover, with the introduction of object-based audio content, which may comprise of more than 100 objects active simultaneously, the complexity of HRIR convolution can be substantially higher than for traditional channel-based content.
Parametric Coding Techniques
Computational complexity is not the only problem for delivery of channel or object-based content within an ecosystem involving content authoring, distribution and reproduction. In many practical situations, and for mobile applications especially, the data rate available for content delivery is severely constrained. Consumers, broadcasters and content providers have been delivering stereo (two-channel) audio content using lossy perceptual audio codecs with typical bit rates between 48 and 192 kbits/s. These conventional channel-based audio codecs, such as MPEG-1 layer 3 (Brandenberg et al., 1994), MPEG AAC (Bosi et al., 1997) and Dolby Digital (Andersen et al., 2004) have a bit rate that scales approximately linearly with the number of channels. As a result, delivery of tens or even hundreds of objects results in bit rates that are impractical or even unavailable for consumer delivery purposes.
To allow delivery of complex, object-based content at bit rates that are comparable to the bit rate required for stereo content delivery using conventional perceptual audio codecs, so-called parametric methods have been subject to research and development over the last decade. These parametric methods allow reconstruction of a large number of channels or objects from a relatively low number of base signals. These base signals can be conveyed from sender to receiver using conventional audio codecs, augmented with additional (parametric) information to allow reconstruction of the original objects or channels. Examples of such techniques are Parametric Stereo (Schuijers et al., 2004), MPEG Surround (Herre et al., 2008), and MPEG Spatial Audio Object Coding (Herre et al., 2012).
An important aspect of techniques such as Parametric Stereo and MPEG Surround is that these methods aim at a parametric reconstruction of a single, pre-determined presentation (e.g., stereo loudspeakers in Parametric Stereo, and 5.1 loudspeakers in MPEG Surround). In the case of MPEG Surround, a headphone virtualizer can be integrated in the decoder that generates a virtual 5.1 loudspeaker setup for headphones, in which the virtual 5.1 speakers correspond to the 5.1 loudspeaker setup for loudspeaker playback. Consequently, these presentations are not independent in that the headphone presentation represents the same (virtual) loudspeaker layout as the loudspeaker presentation. MPEG Spatial Audio Object Coding, on the other hand, aims at reconstruction of objects that require subsequent rendering.
Turning now to FIG. 2, there will be described in overview, a parametric system 20 supporting channels and objects. The system is divided into encoder 21 and decoder 22 portions. The encoder 21 receives channels and objects 23 as inputs, and generates a down mix 24 with a limited number of base signals. Additionally, a series of object/channel reconstruction parameters 25 are computed. A signal encoder 26 encodes the base signals from downmixer 24, and includes the computed parameters 25, as well as object metadata 27 indicating how objects should be rendered in the resulting bit stream.
The decoder 22 first decodes 29 the base signals, followed by channel and/or object reconstruction 30 with the help of the transmitted reconstruction parameters 31. The resulting signals can be reproduced directly (if these are channels) or can be rendered 32 (if these are objects). For the latter, each reconstructed object signal is rendered according to its associated object metadata 33. One example of such metadata is a position vector (for example an x, y, and z coordinate of the object in a 3-dimensional coordinate system).
Decoder Matrixing
Object and/or channel reconstruction 30 can be achieved by time and frequency-varying matrix operations. If the decoded base signals 35 are denoted by zs[n], with s the base signal index, and n the sample index, the first step typically comprises transformation of the base signals by means of a transform or filter bank.
A wide variety of transforms and filter banks can be used, such as a Discrete Fourier Transform (DFT), a Modified Discrete Cosine Transform (MDCT), or a Quadrature Mirror Filter (QMF) bank. The output of such transform or filter bank is denoted by Zs[k, b] with b the sub-band or spectral index, and k the frame, slot or sub-band time or sample index.
In most cases, the sub-bands or spectral indices are mapped to a smaller set of parameter bands p that share common object/channel reconstruction parameters. This can be denoted by b∈B(p). In other words, B(p) represents a set of consecutive sub bands b that belong to parameter band index p. Conversely, p(b) refers to the parameter band index p that sub band b was mapped to. The sub-band or transform-domain reconstructed channels or objects ŶJ are then obtained by matrixing signals Zi with matrices M[p(b)]:
      [                                                                      Y                ^                            1                        ⁡                          [                              k                ,                b                            ]                                                            ⋮                                                                                Y                ^                            J                        ⁡                          [                              k                ,                b                            ]                                            ]    =            M      ⁡              [                  p          ⁡                      (            b            )                          ]              ⁡          [                                                                  Z                1                            ⁡                              [                                  k                  ,                  b                                ]                                                                          ⋮                                                                              Z                S                            ⁡                              [                                  k                  ,                  b                                ]                                                        ]      
The time-domain reconstructed channel and/or object signals yj [n] are subsequently obtained by an inverse transform, or synthesis filter bank.
The above process is typically applied to a certain limited range of sub-band samples, slots or frames k. In other words, the matrices M[p(b)] are typically updated/modified over time. For simplicity of notation, these updates are not denoted here. However, it is considered that the processing of a set of samples k associated with a matrix M[p(b)] can be a time variant process.
In some cases, in which the number of reconstructed signals J is significantly larger than the number of base signals S, it is often helpful to use optional decorrelator outputs Dm [k, b] operating on one or more base signals that can be included in the reconstructed output signals:
      [                                                                      Y                ^                            1                        ⁡                          [                              k                ,                b                            ]                                                            ⋮                                                                                Y                ^                            J                        ⁡                          [                              k                ,                b                            ]                                            ]    =            M      ⁡              [                  p          ⁡                      (            b            )                          ]              ⁡          [                                                                  Z                1                            ⁡                              [                                  k                  ,                  b                                ]                                                                          ⋮                                                                                                                                                                                                                                                                                                                        Z                                  S                                                                ⁡                                                                  [                                                                      k                                    ,                                    b                                                                    ]                                                                                                                                                                                                                                                                          D                                  1                                                                ⁡                                                                  [                                                                      k                                    ,                                    b                                                                    ]                                                                                                                                                                                                                                                        ⋮                                                                                                                                                                                    D                      M                                        ⁡                                          [                                              k                        ,                        b                                            ]                                                                                                              ]      
FIG. 3 illustrates schematically one form of channel or object reconstruction unit 30 of FIG. 2 in more detail. The input signals 35 are first processed by analysis filter banks 41, followed by optional decorrelation (D1, D2) 44 and matrixing 42, and a synthesis filter bank 43. The matrix M[p(b)] manipulation is controlled by reconstruction parameters 31.
Minimum Mean Square Error (MMSE) Prediction for Object/Channel Reconstruction
Although different strategies and methods exist to reconstruct objects or channels from a set of base signals Zs[k, b], one particular method is often referred to as a minimum mean square error (MMSE) predictor which uses correlations and covariance matrices to derive matrix coefficients M that minimize the L2 norm between a desired and reconstructed signal. For this method, it is assumed that the base signals zs[n] are generated in the downmixer 24 of the encoder as a linear combination of input object or channel signals xi [n]:
            z      s        ⁡          [      n      ]        =            ∑      i        ⁢                  g                  i          ,          s                    ⁢                        x          i                ⁡                  [          n          ]                    
For channel-based input content, the amplitude panning gains gi,s are typically constant, while for object-based content, in which the intended position of an object is provided by time-varying object metadata, the gains gi,s can consequently be time variant. This equation can also be formulated in the transform or sub band domain, in which case a set of gains gi,s[k] is used for every frequency bin/band k, and as such, the gains gi,s[k] can be made frequency variant:
            Z      s        ⁡          [              k        ,        b            ]        =            ∑      i        ⁢                            g                      i            ,            s                          ⁡                  [          k          ]                    ⁢                        X          i                ⁡                  [                      k            ,            b                    ]                    
The decoder matrix 42, ignoring the decorrelators for now, produces:
            [                                                                                    Y                  ^                                1                            ⁡                              [                                  k                  ,                  b                                ]                                                                          ⋮                                                                                                Y                  ^                                J                            ⁡                              [                                  k                  ,                  b                                ]                                                        ]        T    =                    [                                                                              Z                  1                                ⁡                                  [                                      k                    ,                    b                                    ]                                                                                        ⋮                                                                                            Z                  S                                ⁡                                  [                                      k                    ,                    b                                    ]                                                                    ]            T        ⁢          M      ⁡              [                  p          ⁡                      (            b            )                          ]            or in matrix formulation, omitting the sub-band index b and parameter band index p for clarity:Y=ZM Z=XG 
The criterion for computing the matrix coefficients M by the encoder is to minimize the mean-square error E which represents the square error between decoder outputs Ŷj and original input objects/channels Xj:
  E  =            ∑              j        ,        k        ,        b              ⁢                  (                                                            Y                ^                            j                        ⁡                          [                              k                ,                b                            ]                                -                                    X              j                        ⁡                          [                              k                ,                b                            ]                                      )            2      
The matrix coefficients that minimize E are then given in matrix notation by:M=(Z*Z+∈I)−1Z*X with epsilon being a regularization constant, and (*) the complex conjugate transpose operator. This operation can be performed for each parameter band p independently, producing a matrix M[p(b)].
Minimum Mean Square Error (MMSE) Prediction for Representation Transformation
Besides reconstruction of objects and/or channels, parametric techniques can be used to transform one representation into another representation. An example of such representation transformation is to convert a stereo mix intended for loudspeaker playback into a binaural representation for headphones, or vice versa.
FIG. 4 illustrates the control flow for a method 50 for one such representation transformation. Object or channel audio is first processed in an encoder 52 by a hybrid Quadrature Mirror Filter analysis bank 54. A loudspeaker rendering matrix G is computed and applied 55 to the object signals Xi stored in storage medium 51 based on the object metadata using amplitude panning techniques, to result in a stereo loudspeaker presentation Z. This loudspeaker presentation can be encoded with an audio coder 57.
Additionally, a binaural rendering matrix H is generated and applied 58 using an HRTF database 59. This matrix H is used to compute binaural signals Yj which allow reconstruction of a binaural mix using the stereo loudspeaker mix as input. The matrix coefficients M are encoded by audio encoder 57.
The transmitted information is transmitted from encoder 52 to decoder 53 where it is unpacked 61 to include components M and Zs. If loudspeakers are used as a reproduction system, the loudspeaker presentation is reproduced using channel information Zs and hence the matrix coefficients M are discarded. For headphone playback, on the other hand, the loudspeaker presentation is first transformed 62 into a binaural presentation by applying the time and frequency-varying matrix M prior to hybrid QMF synthesis and reproduction 60.
If the desired binaural output from matrixing element 62 is written in matrix notation as:Y=XH then the matrix coefficients M can be obtained in encoder 52 by:M=(G*X*XG+∈I)−1G*X*XH 
In this application, the coefficients of encoder matrix H applied in 58 are typically complex-valued, e.g. having a delay or phase modification element, to allow reinstatement of inter-aural time differences which are perceptually very relevant for sound source localization on headphones. In other words, the binaural rendering matrix H is complex valued, and therefore the transformation matrix M is complex valued. For perceptually transparent reinstatement of sound source localization cues, it has been shown that a frequency resolution that mimics the frequency resolution of the human auditory system is desired (Breebaart 2010).
In the sections above, a minimum mean-square error criterion is employed to determine the matrix coefficients M. Without loss of generality, other well-known criteria or methods to compute the matrix coefficients can be used similarly to replace or augment the minimum mean-square error principle. For example, the matrix coefficients M can be computed using higher-order error terms, or by minimization of an L1 norm (e.g., least absolute deviation criterion). Furthermore various methods can be employed including non-negative factorization or optimization techniques, non-parametric estimators, maximum-likelihood estimators, and alike. Additionally, the matrix coefficients may be computed using iterative or gradient-descent processes, interpolation methods, heuristic methods, dynamic programming, machine learning, fuzzy optimization, simulated annealing, or closed-form solutions, and analysis-by-synthesis techniques may be used. Last but not least, the matrix coefficient estimation may be constrained in various ways, for example by limiting the range of values, regularization terms, superposition of energy-preservation requirements and alike.
Transform and Filter-Bank Requirements
Depending on the application, and whether objects or channels are to be reconstructed, certain requirements can be superimposed on the transform or filter bank frequency resolution for filter bank unit 41 of FIG. 3. In most practical applications, the frequency resolution is matched to the assumed resolution of the human hearing system to give best perceived audio quality for a given bit rate (determined by the number of parameters) and complexity. It is known that the human auditory system can be thought of as a filter bank with a non-linear frequency resolution. These filters are referred to as critical bands (Zwicker, 1961) and are approximately logarithmic of nature. At low frequencies, the critical bands are less than 100 Hz wide, while at high frequencies, the critical bands can be found to be wider than 1 kHz.
This non-linear behavior can pose challenges when it comes to filter bank design. Transforms and filter banks can be implemented very efficiently using symmetries in their processing structure, provided that the frequency resolution is constant across frequency.
This implies that the transform length, or number of sub-bands will be determined by the critical bandwidth at low frequencies, and mapping of DFT bins onto so-called parameter bands can be employed to mimic a non-linear frequency resolution. Such mapping process is for example explained in Breebaart et al., (2005) and Breebaart et al., (2010). One drawback of this approach is that a very long transform is required to meet the low-frequency critical bandwidth constraint, while the transform is relatively long (or inefficient) at high frequencies. An alternative solution to enhance the frequency resolution at low frequencies is to use a hybrid filter bank structure. In such structure, a cascade of two filter banks is employed, in which the second filter bank enhances the resolution of the first, but only in a few of the lowest sub bands (Schuijers et al., 2004).
FIG. 5 illustrates one form of hybrid filter bank structure 41 similar to that set out in Schuijers et al. The input signal z[n] is first processed by a complex-valued Quadrature Mirror Filter analysis bank (CQMF) 71. Subsequently, the signals are down-sampled by a factor Q e.g. 72 resulting in sub-band signals Z[k, b] with k the sub-band sample index, and b the sub band frequency index. Furthermore, at least one of the resulting sub-band signals is processed by a second (Nyquist) filter bank 74, while the remaining sub-band signals are delayed 75 to compensate for the delay introduced by the Nyquist filter bank. In this particular example, the cascade of filter banks results in 8 sub bands (b=1, . . . , 8) which are mapped onto 6 parameter bands p=(1, . . . , 6) with a non-linear frequency resolution. The bands 76 being merged together to form a single parameter band (p=6).
The benefit of this approach is a lower complexity compared to using a single filter bank with many more (narrower) sub bands. The disadvantage, however, is that the delay of the overall system increases significantly, and consequently, the memory usage is also significantly higher which causes an increase in power consumption.