Transmission of moving pictures in real-time is employed in several applications like e.g. video conferencing, net meetings and video telephony.
Video conferencing systems allow for simultaneous exchange of audio, video and data information among multiple conferencing sites. Systems known as Multipoint Control Units (MCUs) perform switching functions to allow the endpoints of multiple sites to intercommunicate in a conference. An endpoint conventionally refers to a video conference terminal, either a stand-alone terminal equipped with at least a camera, a display, a loudspeaker or a headphone and a processor or a video conferencing software client installed on a general purpose computer with the corresponding capabilities. In the following specification, this will also be referred to as a “real endpoint” to distinguish it from “virtual endpoint”, whose definition will be disclosed later in the specification.
The MCU links the sites together by receiving frames of conference signals from the sites, processing the received signals, and retransmitting the processed signals to—appropriate sites. The conference signals include audio, video, data and control information. In a switched conference, the video signal from one of the endpoints, typically that of the loudest speaker, is broadcasted to each of the participants. In a continuous presence conference, video signals from two or more sites are spatially mixed to form a composite video signal for viewing by conference participants. When the different video streams have been mixed together into one single video stream, the composed video stream is transmitted to the different parties of the video conference, where each transmitted video stream preferably follows a set of schemes indicating who will receive which video stream. In general, the different users prefer to receive different video streams. The continuous presence or composite image is a combined picture that may include live video streams, still images, menus or other visual images from participants in the conference. The combined picture may e.g. be composed by several equally sized pictures, or one main picture in addition to one or more smaller pictures in inset windows, commonly referred to as Picture-in-Picture (PIP). PIPs require typically a much lower resolution than the main picture due to the size difference within the screen.
A key problem with existing MCUs using the H.323 and SIP standards is the lack of scalability. In order to host large meetings one of three solutions may be used:
All endpoints call into a single large MCU in a single location. The problem of this is the excessive bandwidth consumption. As an example, if a video conference includes a large number of endpoints in both USA and Europe with the MCU residing in New York, a huge bandwidth usage across the Atlantic between the MCU and the endpoints in Europe would be required.
Another possibility is to cascade several MCUs by using H.243 or similar. The problem with this is that a broken user experience may occur. When all endpoints call into the same MCU, a participant typically views the 4-to-10 most recent speakers simultaneously. When endpoints call into two different MCUs, an endpoint can only see one of the endpoints connected to the other MCU.
There already exists non-standards based MCU dealing with problems discussed above using techniques such as SVC (Scalable Video Coding), but the investment in standards based endpoints would then be lost, and a problem with interoperability would also occur.