In traditional multipoint videoconferencing there is one video stream and one audio stream that is sent from each terminal to a conference bridge. Typically, the conference bridge decodes the audio stream from each terminal to determine the voice activity. The terminals with the highest voice activity, or loudest talkers, are marked as active participants. This may be facilitated by encoding voice activity measurements into packets in the audio streams. Depending on the number of video segments that may be displayed at a remote conference site, a number of video streams associated with the active participants may be decoded and re-encoded for display at one or more remote sites. Alternatively, the video segments may be simply re-routed to the conference sites without re-encoding.
In certain conference systems, the most active audio streams may be mixed for distribution to remote conference sites. More advanced conference systems may perform multiple mixes to prevent an echo effect in local sites where there is an active talker or activity. Thus, each conference site having an active talker may receive a unique mixed audio stream that includes the voices of only the active talkers located at a remote conference site or sites.
Given the myriad of possible audio streams that may be generated in a single multipoint conference, encoding and decoding these streams may be a computationally demanding task. Accordingly, some multipoint conferencing systems may tag packets, or frames, of audio data with an activity metric so that it is easy for the conference bridge to quickly determine which audio streams are active without having to actually decode the audio. Other multipoint conferencing systems may analyze key parts of the packet payload to determine voice activity without the computationally burdensome process of decoding and measuring the activity of the packet.