Desktop video-conferencing using packet-based transport mechanisms is gaining popularity in the market-place. The technology has particular potential for establishing video conferences over the Internet or other data networks employing the Internet Protocol (IP). The technology is similar to that used in the more established Voice-over-IP arena with the signaling protocols the same for both.
Typically, a signaling channel such as H.323 (from the International Telecommunications Union (ITU)) or Session Initiation Protocol (SIP) from the Internet Engineering Taskforce (IETF) is used to establish voice, video and data channels between multiple participants.
Each participant in such a call is referred to as a multimedia endpoint, or endpoint for short. It should be noted that an endpoint may be a logical entity as well as a physical terminal. For example the audio stream may originate from a desktop telephone set whereas the video originates from an adjacent personal computer or other similar device capable of transmitting video. As part of the call set-up these distinct devices are logically represented and presented as a single endpoint. Similarly, the audio, video and other media “streams” may in fact be carried as a single multiplexed signal over a single physical channel. Nevertheless this single multiplexed channel can be viewed as consisting of a number of logically distinct media channels.
The following is a description, given by way of example, of a typical packet-based video-conference implemented in accordance with ITU Recommendation H.323. The H.323 standard is described in the Recommendation H.323 document published by the Telecommunications Sector of the International Telecommunications Union (ITU-T) under the title “Packet Based Multimedia Communications Systems”. This is an umbrella for a set of standards describing equipment, terminals and services for multimedia conferencing over networks such as the Internet.
Multiple participants or endpoints connected to a packet-based data network establish signaling and media channels with a combined conference and call server which is a physical embodiment of the H.323 entity known as a Multipoint Control Unit (MCU). The MCU incorporates a Multipoint Controller (MC) and a Multipoint Processor (MP).
The MC processes the signalling channels from the endpoints and thereby provides the call control capability to negotiate with all endpoints and achieve common levels of communication. The MC also interfaces with the MP.
The MP allows mixing, switching and other processing of media streams under the control of the MC. Thus, the MP manages the media streams coming from the endpoints, and mixes the streams which are transmitted to the endpoints.
In alternative implementations, the MC may be incorporated in a call server, and the MP incorporated in a physically separate conference server, so that the MP has media and data channels re-directed to it by the MC which terminates the signaling channel from each endpoint.
In either case, for each incoming audio stream, the MP normally employs a mixing mechanism to collate and distribute the various combinations of voice packets to each endpoint. This mechanism can either mix all voice channels or use a more advanced algorithm to, for example, identify the N loudest speakers and mix and distribute only those.
To handle multiple video streams, the MP may also choose a mixing strategy, where the mixing takes the form of combining the video streams from all participants into a “picture-in-picture” image, containing reduced images of all conference participants, and then transmitting this combined image to each endpoint, so that all participants may be viewed from each desktop. This has certain merits, but if an MP is required to host a large number of conferences, each with a large number of participants, this mixing may prove wasteful of valuable MP resources.
Although the combination of video images in this way has its merits for the participants, it requires the MP to decode each signal, reduce the image to the required size, mix this reduced image with each of the other reduced images to form a combined image, and then encode this image according to the codec being used by each endpoint.
It will be appreciated that if an MP is required to host a large number of conferences, each with a large number of participants, these processor-intensive decode, mix and encode operations on each signal may prove wasteful of valuable MP resources.
A further difficulty with this type of mixing is that for large conferences, the end result may be of limited use to each participant. For example, if a conference has 20–30 participants, the individual images received in the “picture-in-picture” image may not be of high enough resolution to be usable.
As an alternative to mixing all of the video streams, a common approach that is used is for the MP to distribute the video stream of the loudest speaker to all of the other conference participants. (The loudest speaker in this scenario generally receives the video stream of the second loudest speaker.)
This idea can be extended to incorporate an audio mixing algorithm which picks out the N loudest speakers (where N is a small number (typically 2–3) compared to the number of endpoints in the conference) and for the MP to mix the video streams from these endpoints only.
The advantages of these two approaches are clear. When only the video stream of the loudest speaker is distributed (along with the video of the second loudest speaker to the loudest speaker), the MP need not perform any processor-intensive mixing operations. When the video streams of the N loudest speakers is mixed (with N being substantially less than the total number of participants), the processing power required by the MP is substantially reduced compared to mixing the video from all participants. Both mechanisms model typical voice conference calls quite well where there is typically a small number of active participants (talkers) and a number of passive participants (listeners).