Multimedia conferences generally involve a number of participant devices (e.g., laptops, phones, etc.) which encode audio signals and transmit the encoded audio signal to a server. Some of these encoded audio signals may include participant speech, but often these signals simply include background noise. The server fully decodes and/or encodes each audio stream, including the audio streams of background noise. Based on the audio energies of the fully decoded and/or encoded audio streams, the server determines which participants are currently speaking and mixes only the strongest/highest energy audio stream(s) into a mixed audio signal. The server then sends the mixed audio signal to the participant devices.