Reducing the number of channels is essential for achieving multichannel coding at low bit-rates. For example, parametric stereo coding schemes are based on an appropriate mono downmix from the left and right input channels. The so-obtained mono signal is to be encoded and transmitted by the mono codec along with side-information describing in a parametric form the auditory scene. The side information usually consists of several spatial parameters per frequency sub-band. They could include for example:                Inter-channel Level Difference (ILD) measuring the level difference (or balance) between channels.        Inter-channel Time Difference (ITD) or Inter-channel Phase Difference (IPD) describing the time or phase difference between channels, respectively.        
However, a downmix processing is prone to create signal cancellation and coloration due to inter-channel phase misalignment, which leads to undesired quality degradations. As an example, if the channels are coherent and near out-of-phase, the downmix signal is likely to show perceivable spectral bias, such as the characteristics of a comb-filter.
The downmix operation can be performed in time domain simply by a sum of the left and right channels, as expressed bym[n]=w1l[n]+w2r[n],where l[n] and r[n] are the left and right channels, n is the time index, and w1[n] and w2[n] are weights that determined the mixing. If the weights are constant over time, we speak about passive downmix. It has the disadvantage to be regardless of the input signal and the quality of the obtained downmix signal is highly dependent on input signal characteristics. Adapting the weight over time can reduce this problem to some extent.
However, for solving the main issues, an active downmix is usually performed in the frequency domain using for example a Short-Term Fourier Transform (STFT). Thereby the weights can be made dependent of the frequency index k and time index n and can fit better to the signal characteristics. The downmix signal is then expressed as:M[k,n]=W1[k,n]L[k,n]+W2[k,n]R[k,n]where M[k,n], L[k,n] and R[k,n] are the STFT components of the downmix signal, the left channel and the right channel, respectively, at frequency index k and time index n. The weights W1[k,n] and W2[k,n] can be adaptively adjusted in time and in frequency. It aims at preserving the average energy or amplitude of the two input channels by minimizing spectral bias caused by comb filtering effects.
The most straightforward method for active downmixing is to equalize the energy of the downmix signal to yield for each frequency bin or sub-band the average energy of the two input channels [1]. The downmix signal as shown in FIG. 7b can be then formulated as:
            M      ⁡              [        k        ]              =                  W        ⁡                  [          k          ]                    ⁢              (                              L            ⁡                          [              k              ]                                +                      R            ⁡                          [              k              ]                                      )              where            W      ⁡              [        k        ]              =                                                                                    L                ⁡                                  [                  k                  ]                                                                    2                    +                                                                  R                ⁡                                  [                  k                  ]                                                                    2                                    2          ⁢                                                                                    L                  ⁡                                      [                    k                    ]                                                  +                                  R                  ⁡                                      [                    k                    ]                                                                                      2                              
Such straight forward solution has several shortcomings. First, the downmix signal is undefined when the two channels have phase inverted time-frequency components of equal amplitude (ILD=0 db and IPD=pi). This singularity results from the denominator becoming zero in this case. The output of a simple active downmixing is in this case unpredictable. This behavior is shown in FIG. 7a for various inter-channel level differences where the phase is plotted as a function of the IPD.
For ILD=0 dB, the sum of the two channels is discontinuous at IPD=pi resulting in a step of pi radian. In other conditions, the phase evolves regularly and continuously in modulo 2pi.
The second nature of problems comes from the important variance of the normalization gains for achieving such an energy-equalization. Indeed the normalization gains can fluctuate drastically from frame to frame and between adjacent frequency sub-bands. It leads to an unnatural coloration of the downmix signal and to block effects. The usage of synthesis windows for the STFT and the overlap-add method result in smoothed transitions between processed audio frames. However, a great change in the normalization gains between sequential frames can still lead to audible transition artefacts. Moreover, this drastic equalization can also leads to audible artefacts due to aliasing from the frequency response side lobes of the analysis window of the block transform.
As an alternative, the active downmix can be achieved by performing a phase alignment of the two channels before computing the sum-signal [2-4]. The energy-equalization to be done on the new sum signal is then limited, since the two channels are already in-phase before summing them up. In [2], the phase of the left channel is used as reference for aligning the two channels in phase. If the phases of the left channels are not well conditioned (e.g. zero or low-level noise channel), the downmix signal is directly affected. In [3], this important issue is solved by taking as reference the phase of the sum signal before rotation. Still the singularity problem at ILD=0 dB and IPD=pi is not treated. For this reason, [4] amends the approach by using a broadband phase difference parameter in order to improve stability in such a case. Nonetheless, none of these approaches considered the second nature of problem related to the instability. The phase rotation of the channels can also lead to an unnatural mixing of the input channels and can create severe instabilities and block effects especially when great changes happen in the processing over time and frequency.
Finally, there are more evolved techniques like [5] and [6], which are based on the observations that the signal cancellation during downmixing occurs only on time-frequency components which are coherent between the two channels. In [5], the coherent components are filtered out before summing-up incoherent parts of the input channels. In [6], the phase alignment is only computed for the coherent components before summing up the channels. Moreover, the phase alignment is regularized over time and frequency for avoiding problems of stability and discontinuity. Both techniques are computationally demanding since in [5] filter coefficients need to be identified at every frame and in [6] a covariance matrix between the channels has to be computed.