Such a parametric coding/decoding technique is for example described in the document by J. Breebaart, S. van de Par, A. Kohlrausch, E. Schuijers, entitled “Parametric Coding of Stereo Audio” in EURASIP Journal on Applied Signal Processing 2005:9, pp. 1305-1322. This example is taken up with reference to FIGS. 1 and 2 respectively describing a parametric stereo coder and decoder.
Thus, FIG. 1 describes a stereo coder receiving two audio channels, a left channel (denoted L) and a right channel (denoted R).
The temporal signals L(n) and R(n), where n is the integer index of the samples, are processed by the blocks 101, 102, 103 and 104 which perform a short-term Fourier analysis. The transformed signals L[k] and R[k], where k is the integer index of the frequency coefficients, are thus obtained.
The block 105 performs a downmix processing to obtain, in the frequency domain from the left and right signals, a monophonic signal, hereinafter called mono signal.
An extraction of spatial information parameters is also performed in the block 105. The extracted parameters are as follows.
The ICLD (for “InterChannel Level Difference”) parameters, also called interchannel intensity differences, characterize the energy ratios per frequency sub-band between the left and right channels. These parameters make it possible to position sound sources in the stereo horizontal plane by “panning”. They are defined in dB by the following formula:
                              ICLD          ⁡                      [            b            ]                          =                              10            ·                          log              10                                ⁢                      {                                                            ∑                                      k                    =                                          k                      b                                                                                                  k                                              b                        +                        1                                                              -                    1                                                  ⁢                                                      L                    ⁡                                          [                      k                      ]                                                        ·                                                            L                      *                                        ⁡                                          [                      k                      ]                                                                                                                    ∑                                      k                    =                                          k                      b                                                                                                  k                                              b                        +                        1                                                              -                    1                                                  ⁢                                                      R                    ⁡                                          [                      k                      ]                                                        ·                                                            R                      *                                        ⁡                                          [                      k                      ]                                                                                            }                    ⁢          dB                                    (        1        )            where L[k] and R[k] correspond to the (complex) spectral coefficients of the L and R channels, each frequency band of index b comprises the frequency lines in the interval [kb, kb+1−1] and the * symbol indicates the complex conjugate.
The ICPD (“InterChannel Phase Difference”) parameters, also called phase differences, are defined according to the following relationship:ICPD[b]=∠(Σk=kbkb+1−1L[k].R*[k])  (2)where ∠ indicates the argument (the phase) of the complex operand.It is also possible to define, in a way equivalent to the ICPD, an interchannel time difference called ICTD and the definition of which known to the person skilled in the art is not recalled here.
Unlike the ICLD, ICPD and ICTD parameters which are localization parameters, the ICC (“InterChannel Coherence”) parameters for their part represent the inter-channel correlation (or coherence) and are associated with the spatial width of the sound sources; the definition thereof is not recalled here, but it is noted in the article by Breebart et al. that the ICC parameters are not necessary in the sub-bands reduced to a single frequency coefficient—in effect, the amplitude and phase differences fully describe the spatialization in this “degenerated” case.
These ICLD, ICPD and ICC parameters are extracted by analysis of the stereo signals, by the block 105. If the ICTD or ITD parameters were also coded, the latter could also be extracted for each sub-band from the spectra L[k] and R[k]; however, the extraction of the ITD parameters is generally simplified by assuming an identical inter-channel time difference for each sub-band and in this case a parameter can be extracted from the time channels L(n) and R(n) through inter-correlations.
The mono signal M[k] is transformed into the time domain (blocks 106 to 108) after short-term Fourier synthesis (inverse FFT, windowing and addition-overlap called Overlap-Add or OLA) and a mono coding (block 109) is then performed. In parallel, the stereo parameters are quantized and coded in the block 110.
Generally, the spectrum of the signals (L[k], R[k]) is divided according to a nonlinear frequency scale of ERB (Equivalent Rectangular Bandwidth) or Bark type, with a number of sub-bands typically ranging from 20 to 34 for a sampled signal of 16 to 48 kHz according to the Bark scale. This scale defines the values of kb and kb+1 for each sub-band b. The parameters (ICLD, ICPD, ICC, ITD) are coded by scalar quantization possibly followed by an entropic coding and/or a differential coding. For example, in the abovementioned article, the ICLD is coded by a non-uniform quantizer (ranging from −50 to +50 dB) with differential entropic coding. The non-uniform quantization step exploits the fact that the auditory sensitivity to the variations of this parameter becomes increasingly weaker as the ICLD value increases.
For the coding of the mono signal (block 109), several quantization techniques with or without memory are possible, for example the “Pulse Code Modulation” (PCM) coding, its version with adaptive prediction called “Adaptive Differential Pulse Code Modulation” (ADPCM) or more advanced techniques such as the perceptual coding by transform or the “Code Excited Linear Prediction” (CELP) coding or a multi-mode coding.
The interest here is more particularly focused on the 3GPP EVS (“Enhanced Voice Services”) recommendation which uses a multi-mode coding. The algorithmic details of the EVS codec are provided in the 3GPP specifications TS 26.441 to 26.451 and they are not therefore repeated here. Hereinbelow, reference will be made to these specifications by the reference EVS.
The input signal of the EVS codec is sampled at the frequency of 8, 16, 32 or 48 kHz and the codec can represent telephone audio bands (narrowband, NB), wideband (WB), super-wideband (SWB) or full band (FB). The bit rates of the EVS codec are divided into two modes:                “EVS Primary”:                    set bit rates: 7.2, 8, 9.6, 13.2, 16.4, 24.4, 32, 48, 64, 96, 128            variable bit rate mode (VBR) with an average bit rate close to 5.9 kbit/s for active speech            “channel-aware” mode at 13.2 in WB and SWB only                        “EVS AMR-WB IO” for which the bit rates are identical to the 3GPP AMR-WB codec (9 modes).        
To that is added the discontinuous transmission mode (DTX) in which the frames detected as inactive are replaced by SID (SID Primary or SID AMR-WB IO) frames which are transmitted intermittently, approximately once every 8 frames.
On the decoder 200, referring to FIG. 2, the mono signal is decoded (block 201), a decorrelator is used (block 202) to produce two versions {circumflex over (M)}(n) and {circumflex over (M)}′(n) of the decoded mono signal. This decorrelation, necessary only when the ICC parameter is used, makes it possible to augment the spatial width of the mono source {circumflex over (M)}(n). These two signals {circumflex over (M)}(n) and {circumflex over (M)}′(n) are switched into the frequency domain (blocks 203 to 206) and the decoded stereo parameters (block 207) are used by the stereo synthesis (or formatting) (block 208) to reconstruct the left and right channels in the frequency domain. These channels are finally reconstructed in the time domain (blocks 209 to 214).
Thus, as mentioned for the coder, the block 105 performs a downmix or downmix processing by combining the stereo channels (left, right) to obtain a mono signal which is then coded by a mono coder. The spatial parameters (ICLD, ICPD, ICC, etc.) are extracted from the stereo channels and transmitted in addition to the bit stream from the mono coder.
Several techniques have been developed for the stereo to mono downmix processing. This downmix can be performed in the time or frequency domain. Two types of downmix are generally distinguished:                the passive downmix which corresponds to a direct matrixing of the stereo channels to combine them into a single signal—the coefficients of the downmix matrix are generally real and of predetermined (set) values;        the active (adaptive) downmix which includes a control of the energy and/or of the phase in addition to the combining of the two stereo channels.        
The simplest example of passive downmix is given by the following time matrixing:
                              M          ⁡                      (            n            )                          =                                            1              2                        ⁢                          (                                                L                  ⁡                                      (                    n                    )                                                  +                                  R                  ⁡                                      (                    n                    )                                                              )                                =                                    [                                                                                          1                      /                      2                                                                            0                                                                                        0                                                                              1                      /                      2                                                                                  ]                        ⁡                          [                                                                                          L                      ⁡                                              (                        n                        )                                                                                                                                                        R                      ⁡                                              (                        n                        )                                                                                                        ]                                                          (        3        )            
This type of downmix does however have the drawback of not conserving the energy of the signals well after the stereo to mono conversion when the L and R channels are not in phase: in the extreme case where L(n)=−R(n), the mono signal is nil, which is not desirable.
An active downmix mechanism improving the situation is given by the following equation:
                              M          ⁡                      (            n            )                          =                              γ            ⁡                          (              n              )                                ⁢                                                    L                ⁢                                  (                  n                  )                                            +                              R                ⁡                                  (                  n                  )                                                      2                                              (        4        )            where γ(n) is a factor which compensates any energy loss.
However, the combining of the signals L(n) and R(n) in the time domain does not make it possible to control any phase differences between the L and R channels finely (with sufficient frequency resolution); when the L and R channels have comparable amplitudes and almost opposite phases, phenomena of “erasure” or “attenuation” (loss of “energy”) on the mono signal can be observed by frequency sub-bands in relation to the stereo channels.
This is why it is often more advantageous in quality terms to perform the downmix in the frequency domain, even if that involves computing time/frequency transforms and induces additional delay and complexity compared to a time downmix.
It is thus possible to transpose the preceding active downmix with the spectra of the left and right channels, as follows:
                              M          ⁡                      [            k            ]                          =                              γ            ⁡                          [              k              ]                                ⁢                                          ⁢                                                    L                ⁡                                  [                  k                  ]                                            +                              R                ⁡                                  [                  k                  ]                                                      2                                              (        5        )            
where k corresponds to the index of a frequency coefficient (Fourier coefficient for example representing a frequency sub-band). The compensation parameter can be set, as follows:
                              γ          ⁡                      [            k            ]                          =                  max          (                      2            ,                                                                                                                                                    L                        ⁡                                                  [                          k                          ]                                                                                                            2                                    +                                                                                                          R                        ⁡                                                  [                          k                          ]                                                                                                            2                                                                                                                                                                                      L                          ⁡                                                      [                            k                            ]                                                                          +                                                  R                          ⁡                                                      [                            k                            ]                                                                                                                                      2                                    /                  2                                                              )                                    (        6        )            
There is thus an assurance that the overall energy of the downmix is the sum of the energies of the left and right channels. The factor y[k] is here saturated at an amplification of 6 dB.
The stereo to mono downmix technique of the document by Breebaart et al. cited previously is performed in the frequency domain. The mono signal M[k] is obtained by a linear combining of the L and R channels according to the equation:M[k]=w1L[k]+w2R[k]  (7)
where w1, w2 are complex value gains. If w1=w2=0.5, the mono signal is considered to be an average of the two L and R channels. The gains w1,w2 are generally adapted according to the short-term signal in particular to align the phases.
A particular case of this frequency downmix technique is proposed in the document entitled “A stereo to mono downmixing scheme for MPEG-4 parametric stereo encoder” by Samsudin, E. Kurniawati, N. Boon Poh, F. Sattar, S. George, in Proc. ICASSP, 2006. In this document, the L and R channels are aligned in phase before performing the downmix processing.
More specifically, the phase of the L channel for each frequency sub-band is chosen as the reference phase, the R channel is aligned according to the phase of the L channel for each sub-band by the following formula:R′[k]=ej.ICPD[b]R[k]  (8)
where j=√{square root over (−1)},R′[k] is the aligned R channel, k is the index of a coefficient in the bth frequency sub-band, ICPD[b] is the inter-channel phase difference in the bth frequency sub-band given by the equation (1). Note that when the sub-band of index b is reduced to a frequency coefficient, the following applies:R′[k]=|R[k]|.ej∠L[k]  (9)
Finally, the mono signal obtained by the downmix of the document by Samsudin et al. cited previously is computed by averaging the L channel and the aligned R′ channel, according to the following equation:
                              M          ⁡                      [            k            ]                          =                                            L              ⁡                              [                k                ]                                      +                                          R                ′                            ⁡                              [                k                ]                                              2                                    (        10        )            
The phase alignment therefore makes it possible to conserve the energy and to avoid the problems of attenuation by eliminating the influence of the phase. This downmix corresponds to the downmix described in the document by Breebart et al., where:M[k]=w1L[k]+w2R[k]  (11)with w1=0.5 and
      w    2    =            e              j        ·                  lCPD          ⁡                      [            b            ]                                2  in the case where the sub-band of index b comprises only one frequency value of index k.
An ideal conversion of a stereo signal to a mono signal should avoid the problems of attenuation for all the frequency components of the signal.
This downmix operation is important for the parametric stereo coding because the decoded stereo signal is only a spatial formatting of the decoded mono signal.
The downmix technique in the frequency domain described previously does conserve the energy level of the stereo signal well in the mono signal by aligning the R channel and the L channel before performing the processing. This phase alignment makes it possible to avoid the situations where the channels are in phase opposition.
The method described in the document by Samsudin referenced above however relies on a total dependency of the downmix processing on the channel (L or R) chosen to set the reference phase.
In the extreme cases, if the reference channel is nil (“total” silence) and the other channel is non-nil, the phase of the mono signal after downmix becomes constant, and the resulting mono signal will generally be of poor quality; similarly, if the reference channel is a random signal (ambient noise, etc.), the phase of the mono signal can become random or be ill-conditioned with, here again, a mono signal which will generally be of poor quality.
An alternative frequency downmix technique has been proposed in the document entitled “Parametric stereo extension of ITU-T G.722 based on a new downmixing scheme” by T. M. N Hoang, S. Ragot, B. Kovesi, P. Scalart, Proc. IEEE MMSP, 4-6 Oct. 2010. This document proposes a downmix technique which resolves the drawbacks of the downmix proposed by Samsudin et al. According to this document, the mono signal M[k] is computed from the stereo channels L[k] and R[k] by the polar decomposition M[k]=|M[k]|.ej∠M[k], where the amplitude |M[k]| and the phase ∠M[k] for each sub-band are defined by:
                    {                                                                                                                    M                    ⁡                                          [                      k                      ]                                                                                        =                                                                                                                          L                        ⁡                                                  [                          k                          ]                                                                                                            +                                                                                        R                        ⁡                                                  [                          k                          ]                                                                                                                            2                                                                                                                          ∠                  ⁢                                                                          ⁢                                      M                    ⁡                                          [                      k                      ]                                                                      =                                  (                                                            ∠                      ⁢                                                                                          ⁢                                              L                        ⁡                                                  [                          k                          ]                                                                                      +                                          ∠                      ⁢                                                                                          ⁢                                              R                        ⁡                                                  [                          k                          ]                                                                                                      )                                                                                        (        12        )            The amplitude of M[k] is the average of the amplitudes of the L and R channels. The phase of M[k] is given by the phase of the signal summing the two stereo channels (L+R).
The method of Hoang et al. preserves the energy of the mono signal like the method of Samsudin et al., and it avoids the problem of total dependency of one of the stereo channels (L or R) for the phase computation ∠M[k]. However, it presents a disadvantage when the L and R channels are in virtual phase opposition in certain sub-bands (with, as extreme case L=−R). In these conditions, the resulting mono signal will be of poor quality.
In the ITU-T G.722 annex D codec and in the article “Parametric stereo coding scheme with a new downmix method and whole band inter channel time/phase differences” by W. Wu, L. Miao, Y. Lang, D. Virette, Proc. ICASSP. 2013, another method making it possible to manage the phase opposition of the stereo signals has been described. The method relies in particular on the estimation of a full band phase parameter. It is possible to check experimentally that the quality of this method is unsatisfactory for stereo signals where the phase relationship between channels is complex or for stereo speech signals with sound pick-up of AB type (using two omnidirectional microphones spaced apart). In effect, this method consists in computing the phase of the downmix signal from the phases of the L and R signals, and this computation can result in audio artifacts for certain signals because the phase defined by short-term FFT analysis is a parameter that is difficult to interpret and manipulate.
Furthermore, this method does not directly take account of the phase changes which can occur in successive frames which can possibly bring about phase jumps.
There is thus a need for a coding/decoding method of limited complexity which makes it possible to combine channels with a “robust” quality, that is to say a good quality regardless of the type of multi-channel signal, while managing the signals in phase opposition, the signals whose phase is ill-conditioned (e.g.: a nil channel or a channel containing only noise), or the signals for which the channels exhibit complex phase relationships that it would be better not to “manipulate”, to avoid the quality problems that these signals can create.