1. Field of the Invention
The present invention relates to systems allowing simultaneous exchange of audio, video and data information by the use of telecommunication. In particular, it relates to videoconferencing and web conferencing systems.
2. Discussion of the Background
In particular the invention describes a system and a method allowing simultaneous exchange of audio, video and data information between pluralities of units, using existing telecommunication networks.
There are a number of technological systems available for arranging meetings between participants located in different areas. These systems may include audio visual multipoint conferences or videoconferencing, web conferencing and audio conferencing.
The most realistic substitute of real meetings is high-end videoconferencing systems. Conventional videoconferencing systems comprise a number of end-points communicating real-time video, audio and/or data streams over and between various networks such as WAN, LAN and circuit switched networks. The end-points include one or more monitor(s), camera(s), microphone(s) and/or data capture device(s) and a codec. The codec encodes and decodes outgoing and incoming streams, respectively.
Multimedia conferences maybe divided into three main categories; centralized, decentralized and hybrid conferences, wherein each category has a plurality of variations for running a conference.
Centralized Conferences
Traditional Audio Visual Multipoint conferences have a central Multipoint Control Unit (MCU) connected to three or more endpoints. These MCU's perform switching functions to allow the audiovisual terminals to intercommunicate in a conference. The central function of an MCU is to link multiple video teleconferencing sites (EP—endpoints) together by receiving frames of digital signals from audiovisual terminals (EP), processing the received signals, and retransmitting the processed signals to appropriate audiovisual terminals (EP) as frames of digital signals. The digital signals may include audio, video, data and control information. Video signals from two or more audiovisual terminals (EP) can be spatially mixed to form a composite video signal for viewing by teleconference participants. The MCU acts as a selective router of media streams in this scenario. A part of the MCU called the Multipoint Controller (MC) controls the conference. Each endpoint has a control channel for sending and receiving control signals to and from the MC. The MC acts on and sends commands to the endpoints.
Voice Switch Single Stream
In a centralized conference the MCU will receive incoming video streams from all of the participants. It may relay one video stream from one endpoint to all the other endpoints. Which endpoint stream is selected, is typically, for the voice switched single stream solution, based on which participant talks the loudest, i.e., the speaker. This stream is called the Current View. Whereas the Previous View is the video stream from the participant at the endpoint who was the speaker before the current speaker. In a Voice Switched Conference, a Current View video stream is sent to all, other than the current speaker, and the Previous View will be sent to the current speaker. A problem for the MCU is to ensure that the Current View and the Previous view are receivable by all endpoints in the conference.
Switch Single Stream by Other Means
The Current view may also be controlled by sending commands between the MCU and the endpoints. One such mechanism is called floor control. An endpoint can send a floor request command to the MCU so its video will be sent to all other participants. The Previous View will then typically be a Voice Switch View between all the other participants in the conference. The Current View can be released by sending a floor release command. There are other known methods of controlling the Current view, such as floor control or chair control. Floor control and chair control both deal with switch single stream. The principle, however, with a current view and switch of a single stream is the same.
Continuous Presence
In a conference, one would often like to see more than one participant. This can be achieved in several ways. The MCU can combine the incoming video streams to make one or more outgoing video streams. Several incoming low-resolution video streams from the endpoints can be combined into a high-resolution stream. The high-resolution stream is then sent from the MCU to all or some of the endpoints in the conference. This stream is called a Combined View.
The characteristic of the low-resolution streams limit the format of the high-resolution stream from the MCU. Strict limitations on the incoming low-resolution streams are necessary to ensure that the combined high-resolution stream is receivable by all the endpoints receiving it. The MCU has to, as long as every receiver will receive the same multimedia stream, find “the least common mode,” to ensure acceptable viewing and listening characteristics at the receiver with the poorest capacity. With the many variations of monitors, the MCU should also compensate for different monitors such as 4:3 or 16:9 views. This is not possible with a common mode. This least common mode solution doesn't scale particularly well and it puts heavy restriction on the receivers who has a capacity exceeding the one with the poorest capacity.
Rescaled View
A more flexible solution is to let the MCU rescale all the incoming video streams and make a view receivable to all endpoints that receive it. In order to do the rescaling, the MCU needs to decode all the incoming video streams. The decoded data—raw data—is then resealed and transformed. The different raw data streams are then combined in a composite layout, put together given a set layout, and tailored to the receiver requirements for bitrate and coding standard. The raw data combined stream is then encoded, and we will have a new video stream containing one or more of the incoming streams. This solution is called the Resealed View. To make a Resealed View, the MCU must understand and have the capacity to encode and decode video streams. The more endpoints in the conference, the more capacity the MCU needs in order to decode all the incoming streams. The heavy data manipulation performed by the MCU will add extra delay to the multimedia streams and hence reduce the quality of the multimedia conference; the higher number of endpoints the heavier the data manipulation. Scalability is a concern in a solution like this. The layout may be different to all decoders to avoid that end users see themselves in delayed video on the monitor. Depending on the number of different layouts, different outgoing streams must be encoded. An MCU might differentiate between the endpoints themselves or by groups of endpoints, exemplified by two groups, one for low bitrates giving a first view and one for high bitrates giving a second view.
Decentralized Conference
In a decentralized multipoint scenario, one will only need one centralized MC. Each endpoint will send its media data to all other endpoints—typically by multicast. Each endpoint will mix the audio from all the other endpoints, and will combine or select which video streams to show locally. The MC will still act as the controller for the conference, and each endpoint will have a control connection with the MC.
In a decentralized conference, each endpoint must have the MCU functionality showing a Current/Previous view, Combined View or a Resealed View. The complexity of an endpoint supporting decentralized conferences is higher than for endpoints supporting centralized conferences.
Hybrid Conference
A hybrid conference uses a combination of centralized and decentralized conferences. Some endpoints will be in a centralized conference, and other will be in a decentralized conference. A hybrid conference may have centralized handling of one media stream, and a decentralized distribution of another. Before the start of the multimedia conference, the centralized MCU will send commands to each endpoint participating in the conference. These commands will, among other things, ask the endpoint to inform the MCU of its bit rate capabilities and its codec processing capacity. The information received will be used by the centralized MCU to set up a multimedia hybrid conference, wherein the characteristic of each endpoint is taken into account.
The term hybrid will also be used where audio is mixed at the MCU and each endpoint selects and decodes one or more incoming video streams for local view.
Scalable Signal Compression
Scalable signal compression algorithms are a major requirement of the rapidly evolving global network which involves a variety of channels with widely differing capacities. Many applications require data to be simultaneously decidable at a variety of rates. Examples include applications such as multicast in heterogeneous networks, where the channels dictate the feasible bit rates for each user. Similarly, scalable signal compression is motivated by the co-existence of endpoints of differing complexity and cost. A compression technique is scalable if it offers a variety of decoding rates and/or processing requirements using the same basic algorithm, and where the lower rate information streams are embedded within the higher rate bit-streams in a manner that minimizes redundancy.
Several algorithms have been proposed that allow scalability of video communication, including frame rate (Temporally scalable coding), visual quality (SNR) and spatial scalability. Common for these methods is that video is coded in layers, where the scalability comes from decoding one or more layers.
Temporally Scalable Coding
Video is coded in frames and a temporally scalable video coding algorithm allows extraction of video of multiple frame rates from a single coded stream. The video is divided into multiple interleaved sets of frames. By decoding more than one set of frames the framerate is increased.
Spatial Scalable Coding
Spatial scalable compression algorithm is an algorithm where the first layer has a course resolution, and the video resolution can be improved by decoding more layers.
SNR Scalable Coding (Visual Quality Scalable Coding)
SNR-scalable compression refers to encoding a sequence in such a way that different quality video can be reconstructed by decoding a subset of the encoded bit stream. Scalable compression is useful in today's heterogeneous networking environment in which different users have different rate, resolution, display, and computational capabilities.
In a traditional centralized system, the endpoints will send a “full-scale” picture to a MCU, as an example a coded CIF picture (352×288 pixels) will be sent to the MCU. To improve the quality of the conference it will be helpful to present a composite picture at each endpoint. This composite picture may show one participant as a main fraction of a full screen whereas all the other participants are shown as smaller sub-pictures. Which participant, the size of the participant and how many participants that are displayed at each site may depend on processing and display capabilities and the conference situation. If each endpoint is supposed to receive composite pictures, the MCU has to perform heavy data manipulation as described in continuous presence and resealed view. After decoding the coded CIF data streams to video pictures, the MCU will compose composite pictures that will be reencoded and sent to the appropriate endpoint.
This solution puts heavy demand on the capacity of the central MCU, and will, in cases where heavy use of encoding and decoding is necessary, introduce an annoying delay between the participants of a multimedia conference.