This type of coding/decoding is based on the extraction of spatial information parameters so that, upon decoding, these spatial characteristics may be reproduced for the listener, in order to recreate the same spatial image as in the original signal.
Such a technique for parametric coding/decoding is for example described in the document by J. Breebaart, S. van de Par, A. Kohlrausch, E. Schuijers, entitled “Parametric Coding of Stereo Audio” in EURASIP Journal on Applied Signal Processing 2005:9, 1305-1322. This example is reconsidered with reference to FIGS. 1 and 2 respectively describing a parametric stereo coder and decoder.
Thus, FIG. 1 describes a coder receiving two audio channels, a left channel (denoted L for Left in English) and a right channel (denoted R for Right in English).
The time-domain channels L(n) and R(n), where n is the integer index of the samples, are processed by the blocks 101, 102, 103 and 104, respectively, which perform a fast Fourier analysis. The transformed signals L[j] and R[j], where j is the integer index of the frequency coefficients, are thus obtained.
The block 105 performs a channel reduction processing, or “downmix” in English, so as to obtain in the frequency domain, starting from the left and right signals, a monophonic signal hereinafter referred to as ‘mono signal’ which here is a sum signal.
An extraction of spatial information parameters is also carried out in the block 105. The extracted parameters are as follows.
The parameters ICLD (for “Inter-Channel Level Difference” in English), also referred to as ‘inter-channel intensity differences’, characterize the energy ratios by frequency sub-band between the left and right channels. These parameters allow sound sources to be positioned in the stereo horizontal plane by “panning”. They are defined in dB by the following formula:
                              ICLD          ⁡                      [            k            ]                          =                  10.          ⁢                                    log              10                        (                                                            ∑                                      j                    =                                          B                      ⁡                                              [                        k                        ]                                                                                                                        B                      ⁡                                              [                                                  k                          +                          1                                                ]                                                              -                    1                                                  ⁢                                                                  ⁢                                                      L                    ⁡                                          [                      j                      ]                                                        ·                                                            L                      *                                        ⁡                                          [                      j                      ]                                                                                                                    ∑                                      j                    =                                          B                      ⁡                                              [                        k                        ]                                                                                                                        B                      ⁡                                              [                                                  k                          +                          1                                                ]                                                              -                    1                                                  ⁢                                                                  ⁢                                                      R                    ⁡                                          [                      j                      ]                                                        ·                                                            R                      *                                        ⁡                                          [                      j                      ]                                                                                            )                    ⁢                                          ⁢          dB                                    (        1        )            where L[j] and R[j] correspond to the spectral (complex) coefficients of the L and R channels, the values B[k] and B[k+1], for each frequency band of index k, define the division into sub-bands of the discrete spectrum and the symbol * indicates the complex conjugate.
The parameters ICPD (for “Inter-Channel Phase Difference” in English), also referred to as ‘phase differences’, are defined according to the following equation:ICPD[k]=(Σj=B[k]B[k+1]−1L[j]·R*[j])  (2)where  indicates the argument (the phase) of the complex operand.In an equivalent manner to the ICPD, an ICTD (for “Inter-Channel Time Difference” in English) may also be defined whose definition, known to those skilled in the art, is not recalled here.
In contrast to the parameters ICLD, ICPD and ICTD, which are localization parameters, the parameters ICC (for “Inter-Channel Coherence” in English) on the other hand represent the inter-channel correlation (or coherence) and are associated with the spatial width of the sound sources; their definition is not recalled here, but it is noted in the article by Breebart et al. that the ICC parameters are not needed in the sub-bands reduced to a single frequency coefficient—the reason being that the amplitude and phase differences completely describe the spatialization, in this case “degenerate”.
These ICLD, ICPD and ICC parameters are extracted by analyzing the stereo signals, by the block 105. If the ICTD parameters were also coded, these could also be extracted by sub-band from the spectra L[j] and R[j]; however, the extraction of the ICTD parameters is generally simplified by assuming an identical inter-channel time difference for each sub-band and, in this case, these parameters may be extracted from the time-varying channels L(n) and R(n) by means of inter-correlations.
The mono signal M[j] is transformed in the time domain (blocks 106 to 108) after fast Fourier processing (inverse FFT, windowing and addition-overlapping known as OverLap-Add or OLA in English) and a mono coding (block 109) is subsequently carried out. In parallel, the stereo parameters are quantified and coded in the block 110.
Generally speaking, the spectrum of the signals (L[j], R[j]) is divided according to a non-linear frequency scale of the ERB (Equivalent Rectangular Bandwidth) or Bark type, with a number of sub-bands typically going from 20 to 34 for a signal sampled from 16 to 48 kHz. This scale defines the values of B[k] and B[k+1] for each sub-band k. The parameters (ICLD, ICPD, ICC) are coded by scalar quantization potentially followed by an entropic coding and/or by a differential coding. For example, in the article previously cited, the ICLD is coded by a non-uniform quantifier (going from −50 to +50 dB) with differential entropic coding. The non-uniform quantization pitch exploits the fact that the higher the value of the ICLD the lower the auditive sensitivity to the variations in this parameter.
For the coding of the mono signal (block 109), several techniques for quantization with or without memory are possible, for example the coding “Pulse Code Modulation” (PCM), its adaptive version known as “Adaptive Differential Pulse Code Modulation” (ADPCM) or more sophisticated techniques such as the perceptual coding by transform or the coding “Code Excited Linear Prediction” (CELP).
This document is more particularly focused on the recommendation UIT-T G.722 which uses ADPCM coding using codes interleaved in sub-bands.
The input signal of a coder of the G.722 type, in broadband, has a minimum bandwidth of [50-7000 Hz] with a sampling frequency of 16 kHz. This signal is decomposed into two sub-bands [0-4000 Hz] and [4000-8000 Hz] obtained by decomposition of the signal by quadrature mirror filters (or QMF), then each of the sub-bands is coded separately by an ADPCM coder.
The low band is coded by an embedded-codes ADPCM coding over 6, 5 and 4 bits, whereas the high band is coded by an ADPCM coder with 2 bits per sample. The total data rate is 64, 56 or 48 bit/s depending on the number of bits used for the decoding of the low band.
The recommendation G.722 dating from 1988 was first of all used in the ISDN (Integrated Services Digital Network) for audio and videoconference applications. For several years, this coder has been used in applications of HD (High Definition) improved quality voice telephony, or “HD voice” in English, over a fixed IP network.
A quantified signal frame according to the G.722 standard is composed of quantization indices coded over 6, 5 or 4 bits per sample in low band (0-4000 Hz) and 2 bits per sample in high band (4000-8000 Hz). Since the frequency of transmission of the scalar indices is 8 kHz in each sub-band, the data rate is of 64, 56 or 48 kbit/s.
In the decoder 200, with reference to FIG. 2, the mono signal is decoded (block 201), and a de-correlator is used (block 202) to produce two versions {circumflex over (M)}(n) and {circumflex over (M)}′(n) of the decoded mono signal. This decorrelation allows the spatial width of the mono source {circumflex over (M)}(n) to be increased and of thus avoid it being a point-like source. These two signals {circumflex over (M)}(n) and {circumflex over (M)}′(n) are passed into the frequency domain (blocks 203 to 206) and the decoded stereo parameters (block 207) are used by the stereo synthesis (or shaping) (block 208) to reconstruct the left and right channels in the frequency domain. These channels are finally reconstructed in the time domain (blocks 209 to 214).
Thus, as mentioned for the coder, the block 105 performs a downmix, by combining the stereo channels (left, right) so as to obtain a mono signal which is subsequently coded by a mono coder. The spatial parameters (ICLD, ICPD, ICC, etc.) are extracted from the stereo channels and transmitted in addition to the binary pulse train coming from the mono coder.
Several techniques have been developed for the downmix. This downmix may be carried out in the time or frequency domain. Two types of downmix are generally differentiated:                Passive downmix, which corresponds to a direct matrixing of the stereo channels in order to combine them into a single signal;        Active (or adaptive) downmix, which includes a control of the energy and/or of the phase in addition to the combination of the two stereo channels.        
The simplest example of passive downmix is given by the following time matrixing:
                              M          ⁡                      (            n            )                          =                                            1              2                        ⁢                          (                                                L                  ⁡                                      (                    n                    )                                                  +                                  R                  ⁡                                      (                    n                    )                                                              )                                =                                    [                                                                                          1                      /                      2                                                                            0                                                                                        0                                                                              1                      /                      2                                                                                  ]                        ·                          [                                                                                          L                      ⁡                                              (                        n                        )                                                                                                                                                        R                      ⁡                                              (                        n                        )                                                                                                        ]                                                          (        3        )            
This type of downmix has however the drawback of not well conserving the energy of the signals after the stereo to mono conversion when the L and R channels are not in phase: in the extreme case where L(n)=−R(n), the mono signal is zero, a situation which is undesirable.
A mechanism for active downmix improving the situation is given by the following equation:
                              M          ⁡                      (            n            )                          =                              γ            ⁡                          (              n              )                                ⁢                                                    L                ⁡                                  (                  n                  )                                            +                              R                ⁡                                  (                  n                  )                                                      2                                              (        4        )            where γ(n) is a factor which compensates for any potential loss of energy.
However, combining the signals L(n) and R(n) in the time domain does not allow a precise control (with sufficient frequency resolution) of any potential phase differences between L and R channels; when the L and R channels have comparable amplitudes and virtually opposing phases, “fade-out” or “attenuation” phenomena (loss of “energy”) on the mono signal may be observed by frequency sub-bands with respect to the stereo channels.
This is the reason that it is often more advantageous in terms of quality to carry out the downmix in the frequency domain, even if this involves calculating time/frequency transforms and leads to a delay and an additional complexity with respect to a time domain downmix.
The preceding active downmix can thus be transposed with the spectra of the left and right channels, in the following manner:
                              M          ⁡                      [            k            ]                          =                              γ            ⁡                          [              k              ]                                ⁢                                                    L                ⁡                                  [                  k                  ]                                            +                              R                ⁡                                  [                  k                  ]                                                      2                                              (        5        )            where k corresponds to the index of a frequency coefficient (Fourier coefficient for example representing a frequency sub-band). The compensation parameter may be set as follows:
                              γ          ⁡                      [            k            ]                          =                  max          ⁡                      (                          2              ,                                                                                                                                                                    L                          ⁡                                                      [                            k                            ]                                                                                                                      2                                        +                                                                                                                    R                          ⁡                                                      [                            k                            ]                                                                                                                      2                                                                                                                                                                                                        L                            ⁡                                                          [                              k                              ]                                                                                +                                                      R                            ⁡                                                          [                              k                              ]                                                                                                                                                  2                                        /                    2                                                                        )                                              (        6        )            
It is thus ensured that the overall energy of the downmix is the sum of the energies of the left and right channels. Here, the factor γ[k] is saturated at an amplification of 6 dB.
The stereo to mono downmix technique in the document by Breebaart et al. cited previously is carried out in the frequency domain. The mono signal M[k] is obtained by a linear combination of the L and R channels according to the equation:M[k]=w1L[k]+w2R[k]  (7)where w1, w2 are gains with complex values. If w1=w2=0.5, the mono signal is considered as an average of the two L and R channels. The gains w1, w2 are generally adapted as a function of the short-term signal, in particular for aligning the phases.
One particular case of this frequency-domain downmix technique is provided in the document entitled “A stereo to mono downmixing scheme for MPEG-4 parametric stereo encoder” by Samsudin, E. Kurniawati, N. Boon Poh, F. Sattar, S. George, in IEEE Trans., ICASSP 2006. In this document, the L and R channels are aligned in phase prior to carrying out the channel reduction processing.
More precisely, the phase of the L channel for each frequency sub-band is chosen as the reference phase, the R channel is aligned according to the phase of the L channel for each sub-band by the following formula:R′[k]=ei·ICPD[b]·R[k]  (8)where i=√{square root over (−1)}, R′[k] is the aligned R channel, k is the index of a coefficient in the bth frequency sub-band, ICPD[b] is the inter-channel phase difference in the bth frequency sub-band given by:ICPD[b]=(Σk=kbk=kb+1−1L[k]·R*[k])  (9)where kb defines the frequency intervals of the corresponding sub-band and * is the complex conjugate. It is to be noted that when the sub-band with index b is reduced to a frequency coefficient, the following is found:R′[k]=|R[k]|·ejL[k]  (10)
Finally, the mono signal obtained by the downmixing in the document by Samsudin et al. cited previously is calculated by averaging the L channel and the aligned R channel, according to the following equation:
                              M          ⁡                      [            k            ]                          =                                            L              ⁡                              [                k                ]                                      +                                          R                ′                            ⁡                              [                k                ]                                              2                                    (        11        )            
The alignment in phase therefore allows the energy to be conserved and the problems of attenuation to be avoided by eliminating the influence of the phase. This downmixing corresponds to the downmixing described in the document by Breebart et al. where:
                                          M            ⁡                          [              k              ]                                =                                                    w                1                            ⁢                              L                ⁡                                  [                  k                  ]                                                      +                                          w                2                            ⁢                              R                ⁡                                  [                  k                  ]                                                                    ⁢                                  ⁢        with        ⁢                                  ⁢                              w            1                    =                                                    1                2                            ⁢                                                          ⁢              and              ⁢                                                          ⁢                              w                2                                      =                                          ICPD                ⁡                                  [                  b                  ]                                            2                                                          (        12        )            
An ideal conversion of a stereo signal to a mono signal must avoid the problems of attenuation for all the frequency components of the signal.
This downmixing operation is important for parametric stereo coding because the decoded stereo signal is only a spatial shaping of the decoded mono signal.
The technique of downmixing in the frequency domain described previously does indeed conserve the energy level of the stereo signal in the mono signal by aligning the R channel and the L channel prior to performing the processing. This phase alignment allows the situations where the channels are in phase opposition to be avoided.
The method of Samsudin et al. is however based on a total dependency on the downmix processing on the channel (L or R) chosen for setting the phase reference.
In the extreme cases, if the reference channel is zero (“dead” silence) and if the other channel is non-zero, the phase of the mono signal after downmixing becomes constant, and the resulting mono signal will, in general, be of poor quality; similarly, if the reference channel is a random signal (ambient noise, etc.), the phase of the mono signal may become random or be poorly conditioned with, here again, a mono signal that will generally be of poor quality.
An alternative technique for frequency downmixing has been proposed in the document entitled “Parametric stereo extension of ITU-T G.722 based on a new downmixing scheme” by T. M. N Hoang, S. Ragot, B. Kovësi, P. Scalart, Proc. IEEE MMSP, 4-6 Oct. 2010. This document provides a downmixing technique which overcomes drawbacks of the downmixing technique provided by Samsudin et al. According to this document, the mono signal M[k] is calculated from the stereo channels L[k] and R[k] by the following formula:M[k]=|M[k]|·ejM[k]where the amplitude |M[k]| and the phase M[k] for each sub-band are defined by:
         {                                                                                    M                ⁡                                  [                  k                  ]                                                                    =                                                                                                  L                    ⁡                                          [                      k                      ]                                                                                        +                                                                        R                    ⁡                                          [                      k                      ]                                                                                                    2                                                                                      ∠              ⁢                                                          ⁢                              M                ⁡                                  [                  k                  ]                                                      =                          ∠              ⁡                              (                                                      L                    ⁡                                          [                      k                      ]                                                        +                                      R                    ⁡                                          [                      k                      ]                                                                      )                                                        The amplitude of M[k] is the average of the amplitudes of the L and R channels. The phase of M[k] is given by the phase of the signal summing the two stereo channels (L+R).
The method of Hoang et al. preserves the energy of the mono signal like the method of Samsudin et al., and it avoids the problem of total dependency on one of the stereo channels (L or R) for the phase calculation M[k]. However, it has a disadvantage when the L and R channels are in virtual phase opposition in certain sub-bands (with as extreme case L=−R). Under these conditions, the resulting mono signal will be of poor quality.
There thus exists a need for a method of coding/decoding which allows channels to be combined while managing the stereo signals in phase opposition or whose phase is poorly conditioned in order to avoid the problems of quality that these signals can create.