In digital signal processing, filters are typically represented by and stored as series or sets of filter coefficients. These coefficients may in turn represent e.g. the sampled impulse response of the filter in either the time or frequency domain or the coefficients of a difference equation such as
                              y          ⁡                      [            t            ]                          =                                            ∑                              n                =                0                                            N                b                                      ⁢                                                  ⁢                                          b                n                            ⁢                              x                ⁡                                  [                                      t                    -                    n                                    ]                                                              -                                    ∑                              n                =                0                                            N                a                                      ⁢                                          a                n                            ⁢                              y                ⁡                                  [                                      t                    -                    n                                    ]                                                                                        (        1        )            where t is the index of the time sample, y is the output of the filter, x is its input, and a and b are the sets of coefficients representing the filter. In the case where the filter is represented by its impulse response, the number of coefficients stored is dependent on which block length is necessary to describe the impulse response of the filter. In the other case where the coefficients in the difference equation (Eq. (1)) are used the number of coefficients is determined by the filter order. It should be noted that for FIR filters (all an=0), the time domain impulse response coincides with the coefficients bn. One application where filters represented by the frequency domain impulse response are stored and used and where the current invention is useful is the binaural decoding of surround sound.
In film theaters around the world, multi-channel surround audio systems have since long placed film audiences in the center of the audio spaces of the film scenes that are being played before them and are giving them a realistic and convincing feeling of “being there”. This audio technology has moved into the homes of ordinary people as home surround sound theatre systems and is now providing them with the sense of “being there” in their own living rooms.
The next field where this audio technology will be used includes mobile wireless units or terminals, in particular small units such as cellular telephones and PDAs (Personal Digital Assistants). There the immersive nature of the surround sound is even more important because of the small sizes of the displays. Moving this technology to mobile units is, however, not a trivial matter. The main obstacles include that:    1. The available bit-rate is in many cases low in wireless mobile channels.    2. The processing power of mobile terminals is often limited.    3. Small mobile terminals generally have only two micro speakers and earplugs or headphones.
This means, in particular for mobile terminals such as cellular telephones, that a surround sound solution for a mobile terminal has to use a much lower bit rate than the 384 kbits/s used in the Dolby Digital 5.1 system. Due to the limited processing power, the decoders of the mobile terminals must be computationally optimized and due to the speaker configuration of the mobile terminal, the surround sound must be delivered through the earplugs or headphones.
A standard way of delivering multi-channel surround sound through headphones or earplugs is to perform a 3D audio or binaural rendering of each of the speaker signals.
In general, in 3D audio rendering a model of the audio scene is used and each incoming monophonic signal is filtered through a set of filters that model the transformations created by the human head, torso and ears. These filters are called head related filters (HRFs) having head related transfer functions (HRTFs) and if appropriately designed, they give a good 3D audio scene perception.
The diagram of FIG. 1 illustrates a method of complete 3D audio rendering of an audio signal according to a 5.1 surround system. The six multi-channel signals according to the 5.1 surround system are:                surround right (SR),        right (R),        center (C),        low frequency (LFE),        left (L)        surround left (SL).        
In the example illustrated in FIG. 1 the center and low frequency signals are combined into one signal. Since the acoustics for the left and the right side are assumed to be symmetric, the head related filtering can be implemented using five different filters HIB, HCB, HC, HIF and HCF. Seen from one side of the head, these filters model the acoustics for sound arriving from speakers located on the same side (ipsilateral) and opposite side (contralateral) of the head, here denoted by the subscript indices I and C. These dimensions are combined with the origin on the medial axis (front or back), giving the superscript indices F and B. The sound located in the center of the audio scene is modeled with the filter HCB.
The quality in terms of 3D perception of such rendering depends on how closely the HRFs model or represent the listener's own head related filtering when she/he is listening. Hence, it may be advantageous if the HRFs can be adapted and personalized for each listener if a good or very good quality is desired. This adaptation and personalization step may include modeling, measurement and in general a user dependent tuning in order to refine the quality of the perceived 3D audio scene.
Current state-of-the-art standardized multi-channel audio codecs require a high amount of bandwidth or a high bit-rate in order to reach an acceptable quality, and thus they prohibit the use of such codecs for services such as wireless mobile streaming.
For instance, even if the Dolby Digital 5.1 codec (AC-3 codes) has a very low complexity when compared to an AAC multi-channel codec, it requires a much higher bit-rate for similar quality. Both codecs, the AAC multi-channel codec and the AC-3 codec, remain until today unusable in the wireless mobile domain because of the high demands that they make on computational complexity and bitrate.
New parametric multi-channel codecs based on the principles of binaural cue coding have been developed. The recently standardized parametric stereo tool is a good example of the low complexity/high quality parametric technique for encoding stereophonic sound. The extension of parametric stereo to multi-channel coding is currently under standardization in MPEG under the name Spatial Audio coding, and is also known as MPEG-surround.
The principles of parametric multi-channel coding can be explained and understood from the block diagram of FIG. 2 that illustrates a general case. A parametric surround encoder 3, also called a multi-channel parametric surround encoder, receives a multichannel, composite audio signal comprising the individual signals x1(n) to xN(n), where N is the number of input channels. For a 5.1 surround system N=6 as stated above. The encoder 3 then forms in a down-mixing unit 5 a composite down-mixed signal comprising the individual down-mixed signals z1(n) to zM(n). The number M of down-mixed channels (M<N) is dependent upon the required or allowable maximum bit-rate, the required quality and the availability of an M-channel audio encoder 7. One key aspect of the encoding process is that the down-mixed composite signal, typically a stereo signal but it could also be a mono signal, is derived from the multi-channel input signal, and it is this down-mixed composite signal that is compressed in the audio encoder 7 for transmission over the channel 11 rather than the original multi-channel signal. The parametric encoder 3 and in particular the down-mixing unit 5 thereof may be capable of performing a down-mixing process, such that it creates a more or less true equivalent of the multi-channel signal in the mono or stereo domain. The parametric surround encoder also comprises a spatial parameter estimation unit 9 that from the input signals x1(n) to xN(n) computes the cues or spatial parameters that in some way can be said to describe the down-mixing process or the assumptions made therein. The compressed audio signal which is output from the M-channel audio encoder and also is the main signal is together with the spatial parameters that constitute side information transmitted over an interface 11 such as a wireless interface to the receiving side that in the case considered here typically is a mobile terminal.
Alternatively, the down-mixing could be supplied by some external unit, such as from a unit employing Artistic Downmix.
On the receiving side, a complementary parametric surround decoder 13 includes an audio decoder 15 and should be constructed to be capable of creating the best possible multi-channel decoding based on knowledge of the down-mixing algorithm used on the transmitting side and the encoded spatial parameters or cues that are received in parallel to the compressed multi-channel signal. The audio decoder 15 produces signals {circumflex over (z)}1(n) to {circumflex over (z)}M(n) that should be as similar as possible to the signals z1(n) to zM(n) on the transmitting side. These are together with the spatial parameters input to a spatial synthesis unit 17 that produces output signals {circumflex over (x)}1(n) to {circumflex over (x)}N(n) that should be as similar as possible to the original input signals x1(n) to xN(n) on the transmitting side. The output signals {circumflex over (x)}1(n) to {circumflex over (x)}N(n) can be input to a binaural rendering system such as that shown in FIG. 1.
It is obvious, that depending on the bandwidth of the transmitting channel over the interface 11 that generally is relatively low there will be a loss of information and hence the signals {circumflex over (z)}1(n) to {circumflex over (z)}M(n) and {circumflex over (x)}1(n) to {circumflex over (x)}N(n) on the receiving side cannot be the same as their counterparts on the transmitting side. Even though they are not quite true equivalents of their counter parts, they may be sufficiently good equivalents perceptually.
In general, such a surround encoding process is independent of the compression algorithm used for the transmitted channels used in the units audio encoder 7 and audio decoder 15 in FIG. 2. The encoding process can use any of a number of high-performance compression algorithms such as AMR-WB+, MPEG-1 Layer III, MPEG-4 AAC or MPEG-4 High Efficiency AAC, or it could even use PCM.
In general, the above operations are done in the transformed signal domain, the transform used being e.g. the Fourier transform or MDCT. This is especially beneficial if the spatial parameter estimation and synthesis in the units 9 and 17 use the same type of transform as that used in the audio encoder 7, also called core codec.
FIG. 3 is a detailed block diagram of an efficient parametric audio encoder. The N-channel discrete time input signal, denoted in vector form as xN(n), is first transformed to the frequency domain in a transform unit 21, and in general to a transform domain that gives a signal xN(k,m). The index k is the index of the transform coefficients, or sub-bands if a frequency domain transform is chosen. The index m represents the decimated time domain index that is also related to the input signal possibly through overlapped frames.
The signal is thereafter down-mixed in a down-mixing unit 5 to generate the M-channel downmix signal zM(k,m), where M<N. A sequence of spatial model parameter vectors pN(k,m) is estimated in an estimation unit 9. This can be either done in an open-loop or closed loop fashion.
Spatial parameters consist of psycho-acoustical cues that are representative of the surround sound sensation. For instance, in an MPEG surround encoder, these parameters consist of inter-channel differences in level, phase and coherence equivalent to the ILD, ITD and IC cues to capture the spatial image of a multi-channel audio signal relative to a transmitted down-mixed signal zM(k,m), or if in closed loop, the decoded signal {tilde over (z)}M(k,m). The cues pN(k,m) can be encoded in a very compact form such as in a spatial parameter quantization unit 23 producing the signal {tilde over (p)}N(k,m) followed by a spatial parameter encoder 25. The M-channel audio encoder 7 produces the main bitstream which in a multiplexer 27 is multiplexed with the spatial side information produced by the parameter encoder. From the multiplexer the multiplexed signal is transmitted to a demultiplexer 29 on the receiving side in which the side information and the main bitstream are recovered as seen in the block diagram of FIG. 4.
On the receiving side the main bitstream is decoded to synthesize a high quality multichannel representation using the received spatial parameters. The main bitstream is first decoded in an M-channel audio decoder 31 from which the decoded signals {circumflex over (z)}M(k,m) are input to the spatial synthesis unit 17. The spatial side information holding the spatial parameters is extracted by the demultiplexer 29 and provided to a spatial parameter decoder 33 that produces the decoded parameters {circumflex over (p)}N(k,m) and transmits them to the synthesis unit 17. The spatial synthesis unit produces the signal {circumflex over (x)}N(k,m), that is provided to the signal F/T transform unit 35 transforming it into the time domain to produce the signal {circumflex over (x)}N(n), i.e. the multichannel decoded signal.
A 3D audio rendering of a multi-channel surround sound can be delivered to a mobile terminal user by using an efficient parametric surround decoder to first obtain the multiple surround sound channels, using for instance the multi-channel decoder described above with reference to FIG. 4. Thereupon, the system illustrated in FIG. 1 is used to synthesize a binaural 3D-audio rendered multichannel signal. This operation is shown in the schematic of FIG. 5.
Alternatively, a more efficient binaural decoder as described in International patent application No. PCT/SE2007/000006, “Personalized Decoding of Multi-Channel Surround Sound”, filed Jan. 5, 2007, can be used. The operation of this binaural decoder is summarized below.
The processing in an MPEG surround decoder can be defined by two matrix multiplications as illustrated in the diagram of FIG. 15, the multiplications shown as including matrix units M1 and M2, also called the predecorrelator matrix unit and the mix matrix unit, respectively, to which the respective signals are input. The first matrix multiplication forms the input signals to decorrelation units or decorrelators D1, D2, . . . , and the second matrix multiplication forms the output signals based on the down-mix input and the output from the decorrelators. The above operations are done for each hybrid subband, indexed by the hybrid subband index k.
In the following, the index n is used for a number of a time slot, k is used to index a hybrid subband, and l is used to index the parameter set. The processing of the input channels to form the output channels can then be described as:vn,k=M1n,kxn,k  (2)yn,k=M2n,kwn,k  (3)where M1n,k is a two-dimensional matrix mapping a certain number of input channels to a certain number of channels going into the decorrelators, and is defined for every time-slot n and every hybrid subband k, and M2n,k is a two-dimensional matrix mapping a certain number of pre-processed channels to a certain number of output channels and is defined for every time-slot n and every hybrid subband k. The matrix M2n,k comes in two versions depending on whether time-domain temporal shaping (TP) or temporal envelope shaping (TES) of the decorrelated signal is used, the two versions denoted M2—wetn,k and M2—dryn,k. Both matrices M1n,k and M2n,k are derived using the binaural parameters transmitted to the decoder. The derivation of M1n,k and M2n,k is described in further detail in ISO/EC 14496-3:200 X/PDAM 4, MPEG Surround N7530, October 2005, Nice, France.
The input vector xn,k to the first matrix unit M1 corresponds to the decoded signals {circumflex over (z)}M(k,m) of FIG. 4 obtained from the M-channel audio decoder 31. The vector wn,k that is input to the mix matrix unit M2 is a combination of the output d1, d2, . . . from the decorrelators D1, D2, . . . , the output from first matrix multiplication, i.e. from the predecorrelator matrix unit M1, and residual signals res1, res2, . . . , and is defined for every time-slot n and every hybrid subband k. The output vector yn,k has components lf, ls, rf, rs, cf and lfe that basically correspond to the signals L, SL, R, SR, C and LFE as described above. The components must be transformed to the time domain and in some way be rendered to be provided to the used earphones, i.e. they cannot be directly used.
A method for 3D audio rendering uses a decoder that includes a “Reconstruct from Model” block that takes extra input such as a representation of the personal HRFs and other rendering parameters in the hybrid filter-bank domain, compare items 43, 37 and 17′ of FIG. 14, and uses it to transform derivatives of the model parameters to other model parameters that allows generating the two binaural signals directly in the transform domain, so that only the binaural 2-channel signal has to be transformed into the discrete time domain, such as in the transform unit 35 of FIG. 14 that illustrates personalized binaural decoding based on the MPEG surround.
A third matrix M3n,k, symbolically shown as the parameter modification matrix M3 in FIG. 16, is in this example a linear mapping from 6 channels to two channels, which are used as input to the user headphones 39 through the transform unit 35. The matrix multiplication can be written as(r,l)=M3n,kyn,k  (4)
By linearity (the associative law) it is clear that the matrices M2n,k and M3n,k can be combined together to form a new set of parameters stored in a new mix matrix M4n,k=M3n,kM2n,k. This combining operation is illustrated in FIG. 17, where the multiplication unit corresponding to the new matrix is shown as the mix matrix unit M4 and the multiplication of the two matrices is made in a multiplying unit 45.
The new mix matrix M4n,k has parameters that depend both on the bit-stream parameters and the user predefined head related filters HRFs and as well on other dynamic rendering parameters if desired.
For the case of head related filters only, the matrix M3n,k can be written as
                              M          3                      n            ,            k                          =                  [                                                                                          H                    C                    F                                    ⁡                                      (                    k                    )                                                                                                                    H                    C                    B                                    ⁡                                      (                    k                    )                                                                                                                    H                    I                    F                                    ⁡                                      (                    k                    )                                                                                                                    H                    I                    B                                    ⁡                                      (                    k                    )                                                                                                                    H                    C                                    ⁡                                      (                    k                    )                                                                                                                    H                    C                                    ⁡                                      (                    k                    )                                                                                                                                            H                    I                    F                                    ⁡                                      (                    k                    )                                                                                                                    H                    I                    B                                    ⁡                                      (                    k                    )                                                                                                                    H                    C                    F                                    ⁡                                      (                    k                    )                                                                                                                    H                    C                    B                                    ⁡                                      (                    k                    )                                                                                                                    H                    C                                    ⁡                                      (                    k                    )                                                                                                                    H                    C                                    ⁡                                      (                    k                    )                                                                                ]                                    (        5        )            the matrix elements being the five different filters which are used to implement the head related filtering and as above are denoted HIB, HCB, HIF, HIF and HCF. In the system of FIG. 15 the filters are represented in the hybrid domain. Such operations of transforming a representation of filters from the time domain to the frequency or transform domain are well known in the signal processing literature. Here the filters that form the matrix M3n,k are functions of the hybrid subband index k and are similar to those illustrated in FIG. 1. This is further detailed in FIG. 17. First the spatial parameters represented in the 20 spatial parameter bands are used to generate the matrix M2pn,j, the mixing matrix in the parameter band domain. Then, the mapping function given in Table A.30 of the above cited ISO/EC document is used to map the parameter subband indices to the hybrid subband indices k to generate M2n,k. The resulting mixing matrix is multiplied with M3n,k and from the result the final mixing matrix M4n,k is generated.
It should be noted that for this simple case the matrix M3n,k is independent of the time slot index n. Head related filters might also be changed dynamically if the user wants another virtual loudspeaker configuration to be experienced through the headphones 39.
From the discussion above it is obvious that an efficient handling of digital filters and their coefficients can be advantageous in some cases, e.g. for applications where there are limited resources for processing and/or storing such as in mobile telephones.