Videoconferencing systems allow two or more remote participants/endpoints to communicate video and audio with each other in real-time using both audio and video. When only two remote participants are involved, direct transmission of communications over suitable electronic networks between the two endpoints can be used. When more than two participants/endpoints are involved, a Multipoint Conferencing Unit (MCU), or bridge, is commonly used to connect all the participants/endpoints. The MCU mediates communications between the multiple participants/endpoints, which may be connected, for example, in a star configuration. It is noted that even when only two participants are involved, it may still be advantageous to utilize an MCU between the two participants.
For a videoconference, the participants/endpoints or terminals are equipped with suitable encoding and decoding devices. An encoder formats local audio and video output at a transmitting endpoint into a coded form suitable for signal transmission over the electronic network. A decoder, in contrast, processes a received signal, which has encoded audio and video information, into a decoded form suitable for audio playback or image display at a receiving endpoint.
Traditionally, an end-user's own image is also displayed on his/her screen to provide feedback (to ensure, for example, proper positioning of the person within the video window).
In practical videoconferencing system implementations over communication networks, the quality of an interactive videoconference between remote participants is determined by end-to-end signal delays. End-to-end delays of greater than 200 ms prevent realistic live or natural interactions between the conferencing participants. Such long end-to-end delays cause the videoconferencing participants to unnaturally restrain themselves from actively participating or responding in order to allow in-transit video and audio data from other participants to arrive at their endpoints.
The end-to-end signal delays include acquisition delays (e.g., the delay corresponding to the time it takes to fill up a buffer in an A/D converter), coding delays, transmission delays (e.g., the delay corresponding to the time it takes to submit a packet of data to the network interface controller of an endpoint), and transport delays (the delay corresponding to the time it takes a packet to travel from endpoint to endpoint over the network). Additionally, signal-processing times through mediating MCUs contribute to the total end-to-end delay in the given system.
An MCU's primary tasks are to mix the incoming audio signals so that a single audio stream is transmitted to all participants, and to mix video frames or pictures transmitted by individual participants/endpoints into a common composite video frame stream, which includes a picture of each participant. It is noted that the terms frame and picture are used interchangeably herein, and further that coding of interlaced frames as individual fields or as combined frames (field-based or frame-based picture coding) can be incorporated as is obvious to persons skilled in the art. The MCUs, which are deployed in conventional communication network systems, only offer a single common resolution (e.g., CIF or QCIF resolution) for all the individual pictures mixed into the common composite video frame distributed to all participants in a videoconferencing session. Thus, conventional communication network systems do not readily provide customized videoconferencing functionality, which enables a participant to view other participants at different resolutions. The customized functionality may, for example, enable the participant to view another specific participant (e.g., a speaking participant) in CIF resolution, and to view other silent participants in QCIF resolution. The MCUs in a network can be configured to provide such customized functionality by repeating the video mixing operation as many times as the number of participants in a videoconference. However, in such configurations, the MCU operations introduce considerable end-to-end delays. Further, the MCU must have sufficient digital signal processing capability to decode multiple audio streams, mix, and re-encode them, and also to decode multiple video streams, composite them into a single frame (with appropriate scaling as needed), and re-encode them again into a single stream. Video conferencing solutions (such as the systems commercially marketed by Polycom Inc., 4750 Willow Road, Pleasanton, Calif. 94588, and Tandberg, 200 Park Avenue, New York, N.Y. 10166) must use dedicated hardware components to provide acceptable quality and performance levels.
Traditional video codecs, whose bitstreams and decoding operation are standardized in ITU-T Recommendation H.261; ITU-T Recommendation H.262|ISO/IEC 13818-2 (MPEG-2 Video) Main profile; ITU-T Recommendation H.263 baseline profile; ISO/IEC 11172-2 (MPEG-1 Video); ISO/IEC 14496-2 simple profile or advanced simple profile; ITU-T Recommendation H.264|ISO/IEC 14496-10 (MPEG4-AVC) baseline profile or main profile or high profile, are specified to provide a single bitstream at a given spatial resolution and bit rate. Hence, when for an encoded video signal a lower spatial resolution or lower bit rate is required compared to the originally encoded spatial resolution or bit rate, the full resolution signal must be received and decoded, potentially downscaled, and re-encoded with the desired spatial resolution and bit rate. The process of decoding, potentially downsampling, and re-encoding requires significant computational resources and typically adds significant subjective distortions to the video signal and delay to the video transmission.
Further, the standard video codecs for video communications are based on “single-layer” coding techniques, which are inherently incapable of exploiting the differentiated QoS capabilities provided by modern communication networks. An additional limitation of single-layer coding techniques for video communications is that even if a lower spatial resolution display is required or desired in an application, a full resolution signal must be received and decoded with downscaling performed at a receiving endpoint or MCU. This wastes bandwidth and computational resources.
In contrast to the aforementioned single-layer video codecs, in “scalable” video codecs based on “multi-layer” coding techniques, two or more bitstreams are generated for a given source video signal: a base layer and one or more enhancement layers. The base layer may be a basic representation of the source signal at a minimum quality level. The minimum quality representation may be reduced in the quality (i.e. signal to noise ratio (“SNR”)), spatial, or temporal resolution aspects or a combination of these aspects of the given source video signal. The one or more enhancement layers correspond to information for increasing the quality of the SNR, spatial, or temporal resolution aspects of the base layer. Scalable video codecs have been developed in view of heterogeneous network environments and/or heterogeneous receivers.
Scalable coding has been a part of standards such as ITU-T Recommendation H.262|ISO/IEC 13818-2 (MPEG-2 Video) SNR scalable or spatially scalable or high profiles. However, practical use of such “scalable” video codecs videoconferencing applications has been hampered by the increased cost and complexity associated with scalable coding, and the lack of widespread availability of high bandwidth IP-based communication channels suitable for video.
Co-pending and commonly assigned International patent application No. PCT/US06/02836, incorporated by reference herein, describes practical scalable video coding techniques specifically addressing videoconferencing applications. Further, copending and commonly assigned International patent application No. PCT/US06/02835, incorporated by reference herein, describes conference server architecture designed to exploit and benefit from the features of scalable video coding techniques for videoconferencing applications. Co-pending and commonly assigned International patent application No. PCT/US06/061815, incorporated by reference herein, describes techniques for providing error resilience, layer switching, and random access capabilities in conference server architectures designed to exploit and benefit from the features of scalable video coding techniques for videoconferencing applications.
Currently, an extension of ITU-T Recommendation H.264|ISO/IEC 14496-10 standard, which offers a more efficient trade-off than previously standardized scalable video codecs, is being considered (Annex G, Scalable Video Coding—SVC). Further developments in video coding research and standardization include the concept of multiple slice groups for error resilience and video mixing in MCUs, i.e., for compositing multiple input videos into one output video. (See S. Wenger and M. Horowitz, “Scattered Slices: A New Error Resilience Tool for H.26L,” JVT-B027, Document of Joint Video Team (JVT) of ITU-T SG16/Q.6 and ISO/IEC JTC 1/SC 29/WG 11 and ITU-T Recommendation H.264|ISO/IEC 14496-10). When all input video signals are coded using ITU-T Recommendation H.264|ISO/IEC 14496-10, no decoding and re-encoding may be needed in an MCU because the various input signals can be placed into the output picture of the MCU as separate slice groups. (See M. M. Hannuksela and Y. K. Wang, “Coding of Parameter Sets,” JVT-C078, Document of Joint Video Team (JVT) of ITU-T SG16/Q.6 and ISO/IEC JTC 1/SC 29/WG 11).
Consideration is now being given to improving conference server or MCU architectures for video conferencing applications. In particular, attention is being directed toward developing server architectures for compositing one or more input video signals into a single output video signal, together with possible server-generated data, using coded domain composition techniques such as multiple slice groups. Desirable conference server architectures will support desirable video conferencing features such as continuous presence, personal view or layout, rate matching, error resilience, and random entry, and will avoid the complexity and delay overhead of the conventional MCU.