In film theaters around the world, multi-channel surround audio systems have since long placed film audiences in the center of the audio spaces of the film scenes that are being played before them and are giving them a realistic and convincing feeling of “being there”. This audio technology has moved into the homes of ordinary people as home surround sound theatre systems and is now providing them with the sense of “being there” in their own living rooms.
The next field where this technology will be used includes mobile wireless units or terminals, in particular small units such as cellular phones, mp3-players (including similar music players) and PDAs (Personal Digital assistants). There the immersive nature of the surround sound is even more important because of the small screens. Moving this technology to the mobile terminal is, however, not a trivial matter. The main obstacles include that:
The available bit-rate is in many cases low especially in wireless mobile channels.
The processing power of the mobile terminal is rather limited.
Small mobile terminals generally have only two micro speakers and ear-plugs or headphones.
This means, in particular for mobile terminals such as cellular phones, that a surround sound solution on a mobile terminal has to use a much lower bit-rate than for example the 384 kbits/sec that is used in the Dolby Digital 5.1 system. Due to the limited processing power, the decoders of the mobile terminals must be computationally optimized and due to the speaker configuration of the mobile terminal the surround sound must be delivered through the earplugs or headphones.
A standard way of delivering multi-channel surround sound through headphones or earplugs is to perform a 3D audio or binaural rendering of the multichannel surround sound.
In general, in 3D audio rendering a model of the audio scene is used and each incoming monophonic signal is filtered through a set of filters that model the transformations created by the human head, torso and ears. These filters are called head related filters (HRF) having head related transfer functions (HRTFs) and if appropriately designed, they give a good 3D audio scene perception.
The diagram of FIG. 1 illustrates a method of complete 3D audio rendering of a multichannel 5.1 audio signal. The six multi-channel signals are:
surround right (SR), right (R), center (C), low frequency element (LFE), left (L) and surround left (SL).
In the example illustrated in FIG. 1 the center and low frequency signals are combined into one signal. Then, five different filters denoted: HIB, HCB, HC, HIF and HCF are needed in order to implement this method of head related filtering. The SR signal is input to filters HIB and HCB, the R signal is input to filters HIF and HCF, the C and LFE signals are jointly input to filter HC, the L signal is input to filters HIF and HCF and the SL signal is input to filters HIB, HCB. The signals output from the filters HIB, HCB, HC, HIF and HCF are summed in a right summing element 1R to give a signal intended to be provided to the right headphone, not shown. The signals output from the filters HIB, HCB, HC, HCF and HCF are summed in a left summing element 1L to give a signal intended to be provided to the left headphone, not shown. In this case a symmetric head is assumed, therefore the filters for the left ear and the right ear are assumed to be similar.
The quality in terms of 3D perception of such rendering depends on how closely the HRFs model or represent the listener's own head related filtering when she/he is listening. Hence, it may be advantageous if the HRFs can be adapted and personalized for each listener if a good or very good quality is desired. This adaptation and personalization step may include modeling, measurement and in general a user dependent tuning in order to refine the quality of the perceived 3D audio scene.
Current state-of-the-art standardized multi-channel audio codecs require a high amount of bandwidth in order to reach an acceptable quality and thus they prohibit the use of such codec for services such as wireless mobile streaming.
For instance, even if the Dolby Digital 5.1 (AC-3 codec) has very low complexity when compared to the AAC (Advanced Audio Coding) multi-channel codec, it requires much more bit-rate for similar quality. Both codecs, the AAC multi-channel codec and AC-3 codec remain until today unusable in the wireless mobile domain because of the high demands that they make on computational complexity and bit-rate.
New parametric multi-channel codecs based on the principles of binaural cue coding have been developed. The recently standardized MPEG parametric stereo tool is a good example of the low complexity/high quality parametric techniques for encoding stereo sound. The extension of parametric stereo to multi-channel coding is currently undergoing standardization in MPEG under the name Spatial Audio coding, and is also known as MPEG-surround.
The principles behind the parametric multi-channel coding can be explained and understood from the block diagram of FIG. 2 that illustrates a general case.
The parametric surround encoder 3, also referred to as a multi-channel parametric surround encoder, receives a multi-channel audio signal comprising the individual signals xI(n) to xN(n), where N is the number of input channels. The encoder 3 then forms in down-mixing unit 5 a down-mixed signal comprising the individual down-mixed signals zI(n) to zM(n). The number of down mixed channels M<N is dependent upon the desired bit-rate, quality and the availability of an M-channel audio encoder 7. One key aspect of the encoding process is that the down-mixed signal, typically a stereo signal but it could also be a mono signal, is derived from the multi-channel input signal, and it is this down mix signal that is compressed in the audio encoder 7 for transmission over the wireless channel 11 rather than the original multi-channel signal. In addition, the parametric surround encoder also comprises a spatial parameter estimation unit 9 that from the input signals xI(n) to xN(n) computes the spatial cues or spatial parameters such as inter-channel level differences, time differences and coherence. The compressed audio signal which is output from the M-channel audio encoder (main signal) is, together with the spatial parameters that constitute side information transmitted to the receiving side that in the case considered here typically is a mobile terminal.
On the receiving side, a parametric surround decoder 13 includes an M-channel audio decoder 15. The audio decoder 15 produces signals {circumflex over (z)}I(n) to {circumflex over (z)}M(n) that the coded version of zI(n) to zM(n). These are together with the spatial parameters input to a spatial synthesis unit 17 that produces output signals {circumflex over (x)}I(n) to {circumflex over (x)}N(n). Because the decoding process is parametric in nature, the decoded signals {circumflex over (x)}I(n) to {circumflex over (x)}N(n) are not necessarily objectively close to the original multichannel signals xI(n) to xN(n) but are subjectively a faithful reproduction of the multichannel audio scene.
It is obvious, that depending on the bandwidth of the transmitting channel over the interface 11 that generally is relatively low there will be a loss of information and hence the signals {circumflex over (z)}I(n) to {circumflex over (z)}M(n) and {circumflex over (x)}I(n) to {circumflex over (x)}N(n) on the receiving side cannot be the same as their counterparts on the transmitting side. Even though they are not quite true equivalents of their counterparts, they may be sufficient good equivalents.
In general, such a surround encoding process is independent of the compression algorithm used in the units encoder 7 (core encoder) and the audio decoder 15 (core decoder) in FIG. 2. The core encoding process can use any of a number of high performance compression algorithms such as AMR-WB+ (extended adaptive multirate wide band), MPEG-1 Layer III (Moving Picture Experts Group), MPEG-4 AAC or MPEG-4 High Efficiency AAC, and it could even use PCM (Pulse Code Modulation).
In general, the above operations are done in the transformed signal domain, such as Fourier transform and in general on some time-frequency decomposition. This is especially beneficial if the spatial parameter estimation and synthesis in the units 9 and 17 use the same type of transform as that used in the audio encoder 7.
FIG. 3 is a detailed block diagram of an efficient parametric audio encoder. The N-channel discrete time input signal, denoted in vector form as xN(n), is first transformed to the frequency domain in a transform unit 21 that gives a signal xN(k, m). The index k is the index of the transform coefficients, or frequency sub-bands. The index m represents the decimated time domain index that is also related to the input signal possibly through overlapped frames.
The signal is thereafter down-mixed in a down-mixing unit 5 to generate the M-channel down mix signal zM(k, m), where M<N. A sequence of spatial model parameter vectors pN(k, m) is estimated in an estimation unit 9. This can be either done in an open-loop or closed loop fashion.
The spatial parameters consist of psycho-acoustical cues that are representative of the surround sound sensation. For instance, these parameters consist of inter-channel level differences (ILD), time differences (ITD) and coherence (IC) to capture the spatial image of a multi-channel audio signal relative to a transmitted down-mixed signal zM(k, m) (or if in closed loop, the decoded signal {tilde over (z)}M(k, m)). The cues pN(k, m) can be encoded in a very compact form such as in a spatial parameter quantization unit 23 producing the signal {tilde over (p)}N(k, m) followed by a spatial parameter encoder 25. The M-channel audio encoder 7 produces the main bit stream which in a multiplexer 27 is multiplexed with the spatial side information produced by the parameter encoder. From the multiplexer the multiplexed signal is transmitted to a demultiplexer 29 on the receiving side in which the side information and the main bit stream are recovered as seen in the block diagram of FIG. 4.
On the receiving side the main bit stream is decoded to synthesize a high quality multichannel representation using the received spatial parameters. The main bit stream is first decoded in an M-channel audio decoder 31 from which the decoded signals {circumflex over (z)}M(k, m) are input to the spatial synthesis unit 17. The spatial side information holding the spatial parameters is extracted by the demultiplexer 29 and provided to a spatial parameter decoder 33 that produces the decoded parameters {tilde over (p)}N(k, m) and transmits them to the synthesis unit 17. The spatial synthesis unit produces the signal {tilde over (x)}N(k, m), that is provided to the signal Frequency-to-time transform unit 35 to produce the signal {circumflex over (x)}N(k, m), i.e. the multichannel decoded signal.
A personalized 3D audio rendering of a multi-channel surround sound can be delivered to a mobile terminal user by using an efficient parametric surround decoder to first obtain the multiple surround sound channels, using for instance the multi-channel decoder described above with reference to FIG. 4. Thereupon, the system illustrated in FIG. 1 is used to synthesize a binaural 3D-audio rendered multichannel signal. This operation is shown in the schematic of FIG. 5.
Work has also been done in which spatial or 3D audio filtering has been performed in the subband domain. In C. A. Lanciani, and R. W. Schafer, “Application of Head-related Transfer Functions to MPEG Audio Signals”, Proc. 31st Symposium on System Theory, Mar. 21-23, 1999, Auburn, Ala., U.S.A., it is disclosed how an MPEG coded mono signal could be spatialized by performing the HR filtering operation in the subband domain. In A. B. Touimi, M. Emerit and J. M. Pernaux, “Efficient Method for Multiple Compressed Audio Streams Spatialization,” Proc. 3rd International Conference on Mobile and Ubiquitous Multimedia, pp. 229-235, Oct. 27-29, 2004, College Park, Md., U.S.A., it is disclosed how a number of individually MPEG coded mono signals can be spatialized by doing the Head Related (HR) filtering operations in the subband domain. The solution is based on a special implementation of the HR filters, in which all HR filters are modeled as a linear combination of a few predefined basis filters.
Applications of 3D audio rendering are multiple and include gamming, mobile TV shows, using standards such as 3GPP MBMS or DVB-H, listening to music concerts, watching movies and in general multimedia services, which contain a multi-channel audio component.
The methods described above of rendering multi-channel surround sound, although attractive since they allow a whole new set of services to be provided to wireless mobile units, have many drawbacks:
First of all, the computational demands of such rendering are prohibitive since both decoding and 3D rendering have to be performed in parallel and in real time. The complexity of a parametric multi-channel decoder even if low when compared to a full waveform multi-channel decoder is still quite high and at least higher than that of a simple stereo decoder. The synthesis stage of spatial decoding has a complexity that is at least proportional to the number of encoded channels. Additionally, the filtering operations of 3D rendering are also proportional to the number of channels.
The second disadvantage consists of the temporary memory that is needed in order to store the intermediate decoded channels. They are in fact buffered since they are needed in the second stage of 3D rendering.
Finally, one of the main disadvantages is that the quality of such 3D audio rendering can be very limited due to the fact that inter-channel correlations may be canceled. The inter-channel correlations are essential due to the way parametric multi-channel coding synthesizes the signals.
In MPEG surround, for instance, the correlations (ICC) and channel level differences (CLD) are estimated only between pairs of channels. The ICC- and the CLD-parameters are encoded and transmitted to the decoder. In the decoder, the received parameters are used in a synthesis tree as depicted in FIG. 7 for one 5-1-5 configuration (in this case the 5-1-51 configuration). FIG. 6 illustrates surround system configuration having 5-1-51 parameterization. From FIG. 6 it can be seen that CLD and ICC parameters in the 5-1-51 configuration are estimated only between pairs of channels.
Due to that the correlations (ICC) and channel level differences (CLD) are only estimated between pairs of channels, not all single correlations are available. This in turn prohibits individual channel manipulation and re-use, as for instance, 3D rendering. In fact, if for instance two un-coded channels, for example RF and RS are uncorrelated and they are encoded by using the 5-1-51 configuration, then no control over their correlation is available since the correlation is simply not transmitted to the decoder as such but only the correlation on the second level of the tree is provided. At the decoder side, this in turn would lead to two correlated decoded channels. In fact, the decoder does not have access, nor does it have control over the correlation between certain individual channels. These channels belong to different third level boxes. In the example of FIG. 6, these are all pairs of channels which belong to different loudspeaker groupings. This can also be seen in FIG. 7. The pairs of channels are the ones which belong to different third-level tree boxes (OTT3, OTT4 OTT2) in the 5-1-51 configuration. This may not be a problem when listening in a loudspeaker environment; however it becomes a problem if the channels are combined together, as in 3D rendering, leading to possible unwanted channel cancellation or over-amplification.