Modern videoconferencing systems allow two or more remote participants/endpoints to communicate video and audio with each other in real-time. When only two remote participants are involved, direct transmission of communications over suitable electronic networks between the two endpoints can be used. When more than two participants/endpoints are involved, a Multipoint Conferencing Unit (MCU), or bridge, is commonly used to connect to all the participants/endpoints. The MCU mediates communications between the multiple participants/endpoints, which may be connected, for example, in a star configuration. The MCU may also be used for point-to-point communication as well, to provide firewall traversal, rate matching, and other functions.
A videoconferencing system requires each user endpoint to be equipped with a device or devices that can encode and decode both video and audio. The encoder is used to transform local audio and video information into a form suitable for communicating to the other parties, whereas the decoder is used to decode and display the video images, or play back the audio, received from other videoconference participants. Traditionally, an end-user's own image is also displayed on his/her own display screen to provide feedback, for example, to ensure proper positioning of the person within the video window.
When more than two participants are present (and in some cases even with only two participants), one or more MCUs are typically used to coordinate communication between the various parties. The MCU's primary tasks are to mix the incoming audio signals so that a single audio stream is transmitted to all participants, and to mix the incoming video signals into a single video signal so that each of the participants is shown in a corresponding portion of a display frame of this mixed video signal show.
The video conferencing systems may use traditional video codecs that are specified to provide a single bitstream at a given spatial resolution and bitrate. For example, traditional video codecs whose bitstreams and decoding operation are standardized in ITU-T Recommendation H.261; ITU-T Recommendation H.262|ISO/IEC 13818-2 (MPEG-2 Video) Main profile; ITU-T Recommendation H.263 baseline profile; ISO/IEC 11172-2 (MPEG-1 Video); ISO/IEC 14496-2 simple profile or advanced simple profile; ITU-T Recommendation H.264|ISO/IEC 14496-10 (MPEG4-AVC) baseline profile or main profile or high profile, are specified to provide a single bitstream at a given spatial resolution and bitrate. In systems using the traditional video codecs, if a lower spatial resolution or lower bitrate is required for an encoded video signal (e.g., at a receiver endpoint) compared to the originally encoded spatial resolution or bitrate, then the full resolution signal must be received and decoded, potentially downscaled, and re-encoded with the desired lower spatial resolution or lower bitrate. The process of decoding, potentially downsampling, and re-encoding requires significant computational resources and typically adds significant subjective distortions to the video signal and delay to the video transmission.
A video compression technique that has been developed explicitly for heterogeneous environments is scalable coding. In scalable codecs, two or more bitstreams are generated for a given source video signal: a base layer, and one or more enhancement layers. The base layer offers a basic representation of the source signal at a given bitrate, spatial and temporal resolution. The video quality at a given spatial and temporal resolution is proportional to the bitrate. The enhancement layer(s) offer additional bits that can be used to increase video quality, spatial and/or temporal resolution.
Although scalable coding has been part of standards such as ITU-T Recommendation H.262|ISO/IEC 13818-2 (MPEG-2 Video) SNR scalable or spatially scalable or high profiles, it has not been used in the marketplace. The increased cost and complexity associated with scalable coding, as well as the lack of wide use of IP-based communication channels suitable for video have been considerable impediments to widespread adoption of scalable coding based technology for practical videoconferencing applications.
Now, commonly assigned International patent application PCT/US06/028365, which is incorporated herein by reference in its entirety, discloses scalable video coding techniques specifically addressing practical videoconferencing applications. The scalable video coding techniques or codecs enable novel architecture of videoconferencing systems, which is further described in commonly assigned International patent applications PCT/US06/028366, PCT/US06/028367, PCT/US06/027368, PCT/US06/061815, and PCT/US06/62569, which are incorporated herein by reference in their entirety.
The Scalable Video Coding Server (SVCS) and Compositing Scalable Video Coding Server (CSVCS) MCU architectures described in PCT/US06/028366 and PCT/US06/62569 enable the adaptation of incoming video signals to requested video resolutions of outgoing video signals according to the needs of the receiving participants. Compared to traditional MCUs, the SVCS and CSVCS architectures require only a small fraction of computational resources, and preserve the input video quality completely, but add only a small fraction of delay in the transmission path.
Currently, an extension of ITU-T Recommendation H.264|ISO/IEC 14496-10 is being standardized which offers a more efficient trade-off than previously standardized scalable video codecs. This extension is called SVC.
An SVC bit-stream typically represents multiple temporal, spatial, and SNR resolutions each of which can be decoded. The multiple resolutions are represented by base layer Network Abstraction Layer (NAL) units, and enhancement layer NAL units. The multiple resolutions of the same signal show statistical dependencies and can be efficiently coded using prediction. Prediction is done for macroblock modes (mb_type and prediction modes, in the case of intra), motion information (motion vector, sub_mb_type and picture reference index), as well as intra content and inter coding residuals enhancing rate-distortion performance of spatial or SNR scalability. The prediction for each of the elements described above is signaled in the enhancement layer through flags, i.e. only the data signaled for prediction in lower layers are needed for decoding the current layer.
Macroblock mode prediction is switched on a macroblock basis, indicating a choice between transmitting a new macroblock mode (as in H.264) and utilizing the macroblock mode in the reference. In SVC, the reference can be from the same layer, but can also be a lower layer macroblock.
Motion information prediction is switched on a macroblock or an 8×8 block basis between inter-picture motion vector prediction as in H.264 or inter-layer motion vector prediction from a reference in case of SVC. For the latter prediction type, the motion information from the base layer or layers with higher priority are re-used (for SNR scalability) or scaled (for spatial scalability) as predictors. In addition to the prediction switch, a motion vector refinement may be transmitted.
Inter coding residual prediction, which is switched on/off on a macroblock basis, re-uses (for SNR scalability) or up-samples (for spatial scalability) the inter coding residuals from a base layer or layers with higher priority, and potentially a residual signal that is added as an SNR enhancement to the predictor.
Similarly, intra content prediction, which is switched on/off on a macroblock basis, directly re-uses (for SNR scalability) or up-samples (for spatial scalability) the intra-coded signal from other pictures as a prediction from a base layer or layers with higher priority, and potentially a residual signal that is added as an SNR enhancement to the predictor.
As is known in the prior art, an SVC bitstream may be decodable at multiple temporal, spatial, and SNR resolutions. In video conferencing, a participant is only interested in a particular resolution. Hence, the data necessary to decode this resolution must be present in the received bit-stream. All other data can be discarded at any point in the path from the transmitting participant to the receiving participant, including the transmitting participant's encoder, and typically at an SVCS/CSVCS. When data transmission errors are expected, however, it may beneficial to include additional data (e.g., part of the base layer signal) to facilitate error recovery and error concealment.
For higher resolutions than the currently decoded resolution at a receiver, complete packets (NAL units) can be discarded (typically by an SVCS/CSVCS), such that only packets containing the currently decoded resolution are left in the bitstream transmitted or sent to the receiver. Furthermore, packets on which the decoding of the current resolution does not depend on can be discarded even when these are assigned to lower resolutions. For the two cases above, high-level syntax elements (from the NAL header information) can be utilized to identify which packets can be discarded.
Consideration is now being given to alternate or improved architectures for videoconferencing systems that use SVC coding techniques for video signals. In particular, attention is being directed to architectures that provide flexibility in processing SVC bit-streams.