There are currently several techniques for center channel extraction, typically based on summing the stereo channel signals, feeding the center channel with that signal, and subtracting something derived from that signal from the stereo signals. However, when utilizing loudspeakers, these approaches often have difficulty in achieving stable audio image for listeners located away from the sweet spot, as well as preserving the width of the stereo image.
One approach to generate a center channel from stereo channels using the following passive 2-to-3 channel up-mix matrix:
                                          (                                                                                L                    ′                                                                                                                    C                    ′                                                                                                                    R                    ′                                                                        )                    =                                    (                                                                    1                                                        0                                                                                        0.707                                                        0.707                                                                                        0                                                        1                                                              )                        ⁢                          (                                                                    L                                                                                        R                                                              )                                      ,                            (                  EQ          .                                          ⁢          1                )            where the factor 0.707 has the effect of equalizing the energy of the three channels when L and R are uncorrelated and of equal energy. However, with this approach the sound image may be narrowed by approximately 25% while the center-panned sound sources may be boosted by 1.25 dB relative to sources panned to the sides. The up-mix matrix may be generalized into a class of energy preserving N-to-M up-mix decoders, which allows the width of the audio image to be controlled. However, the left and right loudspeakers may be required to be re-positioned more widely when the center loudspeaker is added, which is typically not practical. Furthermore, the perceived localization of the sound sources may be significantly altered for listeners outside the sweet spot.
Another approach is to use an active up-mix matrix (or matrix steering) to improve the signal separation by introducing signal-dependent matrix coefficients. This approach may use principal component analysis to identify the dominant signal component and its panning position. The fundamental limitation of this approach is typically the inability of tracking multiple dominant sources simultaneously. This limitation may cause an instability in the audio image. This approach may be extended by introducing sub-band processing, which enables detecting one dominant signal component in each frequency band. However, listening tests often reveal audible artifacts due to parameter adaptation inaccuracies, as well as degradation of performance in connection with delay panning.
Another typical objective with the center channel extraction is the removal of the singer's voice from a recording, useful for applications such as karaoke. A frequency-domain center-panned source separation method may be used, however, with a lack of generality. For example, there is no general description of how to generate a center channel signal compatible to the created stereo signal.
With another approach, center channel extraction is obtained by dividing a stereo signal into time-frequency plane components and applying a left-right similarity measure for deriving a panning index for the dominant source of each component. A similarity measure φ(m,k) is computed as
                                          φ            ⁡                          (                              m                ,                k                            )                                =                                    2              ⁢                                                          ⁢                                                X                  L                                ⁡                                  (                                      m                    ,                    k                                    )                                            ⁢                                                X                  R                  *                                ⁡                                  (                                      m                    ,                    k                                    )                                                                                                                                                                  X                      L                                        ⁡                                          (                                              m                        ,                        k                                            )                                                                                        2                            +                                                                                                            X                      R                                        ⁡                                          (                                              m                        ,                        k                                            )                                                                                        2                                                    ,                            (                  EQ          .                                          ⁢          2                )            where XL(m, k) and XR(m, k) denote the short-time Fourier transforms of the stereo signal.
The center channel signal is extracted by selecting the time-frequency components that correspond to a similarity measure of 1 (maximum) and synthesizing a signal by inverse STFT. This signal is subtracted from the original stereo channels so that the three-channel presentation remains spatially undistinguishable from the two-channel presentation for a listener located at the sweet spot. This approach often has a disadvantage in that the approach does not take into account inter-channel time differences, and is thus limited to recordings using amplitude panning or coincident microphone techniques.