This invention relates generally to real-time multipoint video conferencing.
Video teleconferencing systems allow for the simultaneous exchange of audio, video and data information among a plurality of audio-video terminals. In multipoint video conferencing, typically three or more participants are involved in a video conference. The audio, video and data signals associated with each participant are typically compressed by a user audio-video terminal (AVT) and sent to a multipoint control unit (MCU) for further processing. The multipoint control unit performs switching functions to allow all of the three or more audio-video participants to communicate in a video conference. A principal function of an MCU is to process the received signals and transmit the processed, received signals back to the user terminals. The MCU links multiple video conferencing sites together by receiving data units of digital signals from the audio-video terminals, processes the received data units and retransmits the processed data units to appropriate audio-video terminals as data units or frames of digital signals.
The digital signals include audio information, video information, data and control information. The audio signals from two or more audio-video terminals are mixed to form a composite audio signal. The audio processing typically is relatively straightforward. The audio signals are decoded and summed to provide a composite signal. The composite signal is re-encoded as one audio signal. The re-encoded, summed audio signal is transmitted to those terminals whose audio is not contained in the summed signal. Thus, the participants at each of the terminals can hear what the other participants are saying. Audio encoding can be selective, for example, audio encoding can encode the two or three loudest audio signals in the videoconference. Other arrangements are possible.
Video processing, however, is more difficult since there is no simple way to sum several video signals. There are two ways for a multipoint control unit to handle a video signal. In the so-called "switched video mode" one video source is selected as the broadcaster and is sent to all of the terminals. Typically, the broadcaster is the current speaker who receives video from a previous speaker. In this mode, essentially no video processing is needed except for switching the video source. In a second mode, the so-called "continuous presence" mode, multiple, compressed video bit streams are received by the MCU. These bit streams are processed and combined into one video bit stream so that participants can view multiple persons simultaneously. The combination of several digital bit streams into one stream is also known as "digital video mixing." While non-realtime video mixing is relatively easy, real-time video mixing presents significant challenges because it requires highly complex processing. For example, the delay incurred in processing the video bit streams has to be as small as possible so as to facilitate desirable interaction among the conference participants.