Video conferencing systems that mix and/or relay multiple incoming media streams are known. In some video conferencing systems that are designed to handle video conferencing sessions involving multiple participants, a server receives incoming media streams from some or all of the conference participants, and determines which of the incoming media streams are to be mixed and/or relayed back to the conference participants as outgoing media streams. In some situations, the video conferencing server can receive a large number of incoming media streams. There is usually only a need to mix and/or relay a subset of the incoming media streams. Determining which media streams to mix and/or relay can, in some situations, require a significant amount of processing resources.
One approach involves determining, at the video conferencing server, which of the incoming media streams represent conference participants that are speaking. Commonly, this determination is made in the signal domain using, for example, voice activity detection (VAD). This requires decoding each of the incoming media streams at the video conferencing server to determine which of the incoming media streams represent conference participants that are speaking.