Videoconferencing systems allow two or more remote participants/endpoints to communicate video and audio with each other in real-time using both audio and video. When only two remote participants are involved, direct transmission of communications over suitable electronic networks between the two endpoints can be used. When more than two participants/endpoints are involved, a Multipoint Conferencing Unit (MCU), or bridge, is commonly used to connect to all the participants/endpoints. The MCU mediates communications between the multiple participants/endpoints, which may be connected, for example, in a star configuration.
For a videoconference, the participants/endpoints or terminals are equipped with suitable encoding and decoding devices. An encoder formats local audio and video output at a transmitting endpoint into a coded form suitable for signal transmission over the electronic communication network. A decoder, in contrast, processes a received signal, which has encoded audio and video information, into a decoded form suitable for audio playback or image display at a receiving endpoint.
Traditionally, an end-user's own image is also displayed on his/her screen to provide feedback (to ensure, for example, proper positioning of the person within the video window).
In practical videoconferencing system implementations over communication networks, the quality of an interactive videoconference between remote participants is determined by end-to-end signal delays. End-to-end delays of greater than 200 ms prevent realistic live or natural interactions between the conferencing participants. Such long end-to-end delays cause the videoconferencing participants to unnaturally restrain themselves from actively participating or responding in order to allow in-transit video and audio data from other participants to arrive at their endpoints.
The end-to-end signal delays include acquisition delays (e.g., the time it takes to fill up a buffer in an A/D converter), coding delays, transmission delays (the time it takes to submit a packet-full of data to the network interface controller of an endpoint), and transport delays (the time a packet travels in a communication network from endpoint to endpoint). Additionally, signal-processing times through mediating MCUs contribute to the total end-to-end delay in the given system.
An MCU's primary tasks are to mix the incoming audio signals so that a single audio stream is transmitted to all participants, and to mix video frames or pictures transmitted by individual participants/endpoints into a common composite video frame stream, which includes a picture of each participant. It is noted that the terms frame and picture are used interchangeably herein, and further that coding of interlaced frames as individual fields or as combined frames (field-based or frame-based picture coding) can be incorporated as is obvious to persons skilled in the art. The MCUs, which are deployed in conventional communication networks systems, only offer a single common resolution (e.g., CIF or QCIF resolution) for all the individual pictures mixed into the common composite video frame distributed to all participants in a videoconferencing session. Thus, conventional communication networks systems do not readily provide customized videoconferencing functionality by which a participant can view other participants at different resolutions. Such desirable functionality allows the participant, for example, to view another specific participant (e.g., a speaking participant) in CIF resolution and view other, silent participants in QCIF resolution. MCUs can be configured to provide this desirable functionality by repeating the video mixing operation, as many times as the number of participants in a videoconference. However, in such configurations, the MCU operations introduce considerable end-to-end delay. Further, the MCU must have sufficient digital signal processing capability to decode multiple audio streams, mix, and re-encode them, and also to decode multiple video streams, composite them into a single frame (with appropriate scaling as needed), and re-encode them again into a single stream. Video conferencing solutions (such as the systems commercially marketed by Polycom Inc., 4750 Willow Road, Pleasanton, Calif. 94588, and Tandberg, 200 Park Avenue, New York, N.Y. 10166) must use dedicated hardware components to provide acceptable quality and performance levels.
The performance levels of and the quality delivered by a videoconferencing solution are also a strong function of the underlying communication network over which it operates. Videoconferencing solutions, which use ITU H.261, H.263, and H.264 standard video codecs, require a robust communication channel with little or no loss for delivering acceptable quality. The required communication channel transmission speeds or bitrates can range from 64 Kbps up to several Mbps. Early videoconferencing solutions used dedicated ISDN lines, and newer systems often utilize high-speed Internet connections (e.g., fractional T1, T1, T3, etc.) for high-speed transmission. Further, some videoconferencing solutions exploit Internet Protocol (“IP”) communications, but these are implemented in a private network environment to ensure bandwidth availability. In any case, conventional videoconferencing solutions incur substantial costs associated with implementing and maintaining the dedicated high-speed networking infrastructure needed for quality transmissions.
The costs of implementing and maintaining a dedicated videoconferencing network are avoided by recent “desktop videoconferencing” systems, which exploit high bandwidth corporate data network connections (e.g., 100 Mbit, Ethernet). In these desktop videoconferencing solutions, common personal computers (PCs), which are equipped with USB-based digital video cameras and appropriate software applications for performing encoding/decoding and network transmission, are used as the participant/endpoint terminals.
Recent advances in multimedia and telecommunications technology involve integration of video communication and conferencing capabilities with Internet Protocol (“IP”) communication systems such as IP PBX, instant messaging, web conferencing, etc. In order to effectively integrate video conferencing into such systems, both point-to-point and multipoint communications must be supported. However, the available network bandwidth in IP communication systems can fluctuate widely (e.g., depending on time of day and overall network load), making these systems unreliable for the high bandwidth transmissions required for video communications. Further, videoconferencing solutions implemented on IP communication systems must accommodate both network channel heterogeneity and endpoint equipment diversity associated with the Internet system. For example, participants may access videoconferencing services over IP channels having very different bandwidths (e.g., DSL vs. Ethernet) using a diverse variety of personal computing devices.
The communication networks on which videoconferencing solutions are implemented can be categorized as providing two basic communication channel architectures. In one basic architecture, a guaranteed quality of service (QoS) channel is provided via a dedicated direct or switched connection between two points (e.g., ISDN connections, T1 lines, and the like). Conversely, in the second basic architecture, the communication channels do not guarantee QoS, but are only “best-effort” packet delivery channels such as those used in Internet Protocol (IP)-based networks (e.g., Ethernet LANs).
Implementing video conferencing solutions on IP-based networks may be desirable, at least due to the low cost, high total bandwidth, and widespread availability of access to the Internet. As noted previously, IP-based networks typically operate on a best-effort basis, i.e., there is no guarantee that packets will reach their destination, or that they will arrive in the order they were transmitted. However, techniques have been developed to provide different levels of quality of service (QoS) over the putatively best-effort channels. The techniques may include protocols such as DiffSery for specifying and controlling network traffic by class so that certain types of traffic get precedence and RSVP. These protocols can ensure certain bandwidth and/or delays for portions of the available bandwidth. Techniques such as forward error correction (FEC) and automatic repeat request (ARQ) mechanisms may also be used to improve recovery mechanisms for lost packet transmissions and to mitigate the effects of packet loss.
Implementing video conferencing solutions on IP-based networks requires consideration of the video codecs used. Standard video codecs such as the standard H.261, H.263 codecs designated for videoconferencing and standard MPEG-1 and MPEG-2 Main Profile codecs designated for Video CDs and DVDs, respectively, are designed to provide a single bitstream (“single-layer”) at a fixed bitrate. Some of these codecs may be deployed without rate control to provide a variable bitrate stream (e.g., MPEG-2, as used in DVDs). However, in practice, even without rate control, a target operating bitrate is established depending on the specific infrastructure. These video codecs designs are based on the assumption that the network is able to provide a constant bitrate, and a practically error-free channel between the sender and the receiver. The H-series Standard codecs, which are designed specifically for person-to-person communication applications, offer some additional features to increase robustness in the presence of channel errors, but are still only tolerant to a very small percentage of packet losses (typically only up to 2-3%).
Further, the standard video codecs are based on “single-layer” coding techniques, which are inherently incapable of exploiting the differentiated QoS capabilities provided by modern communication networks. An additional limitation of the single-layer coding techniques for video communications is that even if a lower spatial resolution display is required or desired in an application, a full resolution signal must be received and decoded with downscaling performed at a receiving endpoint or MCU. This wastes bandwidth and computational resources.
In contrast to the aforementioned single-layer video codecs, in “scalable” video codecs based on “multi-layer” coding techniques, two or more bitstreams are generated for a given source video signal: a base layer and one or more enhancement layers. The base layer may be a basic representation of the source signal at a minimum quality level. The minimum quality representation may be reduced in the SNR (quality), spatial, or temporal resolution aspects or a combination of these aspects of the given source video signal. The one or more enhancement layers correspond to information for increasing the quality of the SNR (quality), spatial, or temporal resolution aspects of the base layer. Scalable video codecs have been developed in view of heterogeneous network environments and/or heterogeneous receivers. The base layer can be transmitted using a reliable channel, i.e., a channel with guaranteed Quality of Service (QoS). Enhancement layers can be transmitted with reduced or no QoS. The effect is that recipients are guaranteed to receive a signal with at least a minimum level of quality (the base layer signal). Similarly, with heterogeneous receivers that may have different screen sizes, a small picture size signal may be transmitted to, e.g., a portable device, and a full size picture may be transmitted to a system equipped with a large display.
Standards such as MPEG-2 specify a number of techniques for performing scalable coding. However, practical use of “scalable” video codecs has been hampered by the increased cost and complexity associated with scalable coding, and the lack of widespread availability of high bandwidth IP-based communication channels suitable for video.
Consideration is now being given to developing improved scalable codec solutions for video conferencing and other applications. Desirable scalable codec solutions will offer improved bandwidth, temporal resolution, spatial quality, spatial resolution, and computational power scalability. Attention is in particular directed to developing scalable video codecs that are consistent with simplified MCU architectures for versatile videoconferencing applications. Desirable scalable codec solutions will enable zero-delay MCU architectures that allow cascading of MCUs in electronic networks with no or minimal end-to-end delay penalties.