Multi-party and multimedia communication in real time has been a challenging technical problem for a long time. The most straightforward way is for each user to send media data (such as video, audio, images, text, and documents) to every other user, as illustrated in FIG. 1.
Such a prior art mesh connection of users typically requires very high bandwidth because each user has to receive different media data from multiple users and each user has to send the identical media data to multiple users. The total bandwidth of the data traffic in the network would increase quickly with the number of users. The required processing power of each user terminal would also increase with the number of users. Therefore, such a mesh connection of multiple users is typically disadvantageous.
The prior art video conferencing system of FIG. 2 attempts to solve this problem by using a Multipoint Control Unit (“MCU”) as a central connection point for all users.
To save bandwidth, the MCU receives encoded video bitstreams from all users, decodes them, mixes all or a selected number of video sequences into one video sequence, encodes the combined video sequence, and sends a single bitstream to each user individually. In the process of mixing multiple video sequences, the resolution of some input video sequences typically has to be reduced in order for the combined video sequence to fit into a given resolution. For example, if User 1, User 2, and User 3 use the Common Intermediate Format (“CIF”) for their video, and User 4, User 5, and User 6 use the Quarter CIF (“QCIF”) for their video, the video resolution of the first three users is 352×288 pixels and the video resolution of the last three users is 176×144 pixels. Assuming that the first four video sequences typically are mixed into a single CIF video sequence, the resolution of the first three video sequences has to be reduced from CIF to QCIF before they are combined with the fourth one into the output video sequence. FIG. 3 illustrates the process for this example. The choice of which video sequences are mixed together is typically made by either voice activated selection (“VAS”) or chair control. In the above example, if VAS is used, four video sequences associated with the loudest four voices in the video conference are selected for mixing. If chair control is used, one of the users is designated as the chairperson and this user can determine which video sequences are mixed together.
With a single MCU, the number of users is typically limited because both bandwidth and processing power of the MCU would increase with the number of users. To handle a large number of simultaneous video conferences with many users, in the prior art multiple MCUs are cascaded, as illustrated in FIG. 4. In a traditional video conferencing system, there typically is a Gatekeeper that, among other things, keeps information about which users are connected to which MCUs and how the MCUs are cascaded so that the video calls can be made through appropriate MCUs between users. For each MCU, the connection to another MCU is typically treated the same as the connection to a user. For example, if a video conference involves the three users on MCU 1, two of the users on MCU 2, two of the users on MCU 3, and three of the users on MCU 4, each individual MCU mixes its own local video and sends the mixed video to its neighbor MCU as a single video bitstream. This means that the video from User 1.1 is sent to User 4.1 through three video mixers on MCU 1, MCU 3, and MCU 4.
One of the problems in such a prior art cascaded MCU video conferencing system is the end-to-end delay, especially on an IP network. First, video processing on each MCU introduces a delay. Second, each MCU typically has to wait for all relevant video packets to arrive before decoding and mixing multiple video sequences. There is also transmission delay. The total end-to-end delay can therefore sometimes be too long for users to have real-time interactive communication. The amount of delay typically increases with the number of cascaded MCUs in the delivery path between any two end-points.
Therefore, one disadvantage of a traditional prior art video conferencing system is the inability to handle many users. Another disadvantage of a traditional prior art video conferencing system is that typically the cost per user is relatively high. Another disadvantage is that the complexity of call setup typically can become very high very quickly when the number of users and cascaded MCUs increases.