A “near-end” video conference endpoint captures video of and audio from participants in a room during a conference, for example, and then transmits the captured video and audio to “far-end” video conference endpoints. During the conference, reproduced voice conversations should sound natural and clear to the participants, as if the far-end and near-end participants were in the same room. Participants usually occupy random positions in the room, and it is common practice to place/distribute a number of microphones on a table, on walls, and/or in a ceiling of the room. Typically, a conference sound mixer is used to mix microphone channels from the microphones with highest sound levels, a highest signal to noise ratio (SNR), or a highest direct sound to reverberation ratio (DRR), in an attempt to detect participant voices with a good sound quality. Use of such distributed microphones has drawbacks. For example, from an aesthetic perspective, the distributed microphones add room clutter. Also, installing, configuring, and maintaining the distributed microphones (and mixers) can be time consuming and expensive. In addition, the audio signals captured at the spatially distributed microphones may be highly coherent with different and random phase delays such that, when mixed together, the resultant signal may be distorted due to a comb filtering effect.