Transmission of digital video on packet-based networks such as those based on the Internet Protocol (IP) is extremely challenging, at least due to the fact that data transport is typically done on a best-effort basis. In modern packet-based communication systems errors typically exhibit themselves as packet losses and not bit errors. Furthermore, such packet losses are typically the result of congestion in intermediary routers, and not the result of physical layer errors (one exception to this is wireless and cellular networks). When an error in transmission or receipt of a video signal occurs, it is important to ensure that the receiver can quickly recover from the error and return to an error-free display of the incoming video signal. However, in typical digital video communication systems, the receiver's robustness is reduced by the fact that the incoming data is heavily compressed in order to conserve bandwidth. Further, the video compression techniques employed in the communication systems (e.g., state-of-the-art codecs ITU-T H.264 and H.263 or ISO MPEG-2 and MPEG-4 codecs) can create a very strong temporal dependency between sequential video packets or frames. In particular, use of motion compensated prediction (e.g., involving the use of P or B frames) codecs creates a chain of frame dependencies in which a displayed frame depends on past frame(s). The chain of dependencies can extend all the way to the beginning of the video sequence. As a result of the chain of dependencies, the loss of a given packet can affect the decoding of a number of the subsequent packets at the receiver. Error propagation due to the loss of the given packet terminates only at an “intra” (I) refresh point, or at a frame that does not use any temporal prediction at all.
Error resilience in digital video communication systems requires having at least some level of redundancy in the transmitted signals. However, this requirement is contrary to the goals of video compression techniques, which strive to eliminate or minimize redundancy in the transmitted signals.
On a network that offers differentiated services (e.g., DiffServ IP-based networks, private networks over leased lines, etc.), a video data communication application may exploit network features to deliver some or all of video signal data in a lossless or nearly lossless manner to a receiver. However, in an arbitrary best-effort network (such as the Internet) that has no provision for differentiated services, a data communication application has to rely on its own features for achieving error resilience. Known techniques (e.g., the Transmission Control Protocol—TCP) that are useful in generic data communications are not appropriate for video or audio communications, which have the added constraint of low end-to-end delay arising out of human interface requirements. For example, TCP techniques may be used for error resilience in data transport using the File Transfer Protocol. TCP keeps on retransmitting data until confirmation that all data is received, even if it involves a delay is several seconds. However, TCP is inappropriate for video data transport in a live or interactive videoconferencing application because the end-to-end delay, which is unbounded, would be unacceptable to participants.
A related problem is that of random access. Assume that a receiver joins an existing transmission of a video signal. Typical instances are when a user who joins a videoconference, or a user who tunes in to a broadcast. Such a user would have to find a point in the incoming bitstream where he/she can start decoding and be in synchronization with the encoder. Providing such random access points, however, has a considerable impact on compression efficiency. Note that a random access point is, by definition, an error resilience feature since at that point any error propagation terminates (i.e., it is an error recovery point). Hence, the better the random access support provided by a particular coding scheme, the faster error recovery the coding scheme can provide. The converse may not always be true; it depends on the assumptions made about the duration and extent of the errors that the error resilience technique has been designed to address. For error resilience, some state information could be assumed to be available at the receiver at the time the error occurred.
As an example, in MPEG-2 video codecs for digital television systems (digital cable TV or satellite TV), I pictures are used at periodic intervals (typically 0.5 sec) to enable fast switching into a stream. The I pictures, however, are considerably larger than their P or B counterparts (typically by 3-6 times) and are thus to be avoided, especially in low bandwidth and/or low delay applications.
In interactive applications such as videoconferencing, the concept of requesting an intra update is often used for error resilience. In operation, the update involves a request from the receiver to the sender for an intra picture transmission, which enables the decoder to be synchronized. The bandwidth overhead of this operation is significant. Additionally, this overhead is also incurred when packet errors occur. If the packet losses are caused by congestion, then the use of the intra pictures only exacerbates the congestion problem.
Another traditional technique for error resilience, which has been used in the past (e.g., in the H.261 standard) to mitigate drift caused by mismatch in IDCT implementations, is to periodically code each macroblock in intra mode. The H.261 standard requires forced intra coding every 132 times a macroblock is transmitted.
The coding efficiency decreases with increasing percentage of macroblocks that are forced to be coded as intra in a given frame. Conversely, when this percentage is low, the time to recover from a packet loss increases. The forced intra coding process requires extra care to avoid motion-related drift, which further limits the encoder's performance since some motion vector values have to be avoided, even if they are the most effective.
In addition to traditional, single-layer codecs, layered or scalable coding is a well-known technique in multimedia data encoding. Scalable coding is used to generate two or more “scaled” bitstreams collectively representing a given medium in a bandwidth-efficient manner. Scalability can be provided in a number of different dimensions, namely temporally, spatially, and quality (also referred to as SNR “Signal-to-Noise Ratio” scalability or fidelity scalability). For example, a video signal may be scalably coded in different layers at CIF and QCIF resolutions, and at frame rates of 7.5, 15, and 30 frames per second (fps). Depending on the codec's structure, any combination of spatial resolutions and frame rates may be obtainable from the codec bitstream. The bits corresponding to the different layers can be transmitted as separate bitstreams (i.e., one stream per layer) or they can be multiplexed together in one or more bitstreams. For convenience in description herein, the coded bits corresponding to a given layer may be referred to as that layer's bitstream, even if the various layers are multiplexed and transmitted in a single bitstream. Codecs specifically designed to offer scalability features include, for example, MPEG-2 (ISO/IEC 13818-2, also known as ITU-T H.262) and the currently developed SVC (known as ITU-T H.264 Annex G or MPEG-4 Part 10 SVC). Scalable coding techniques specifically designed for video communication are described in commonly assigned international patent application No. PCT/US06/028365, “SYSTEM AND METHOD FOR SCALABLE AND LOW-DELAY VIDEOCONFERENCING USING SCALABLE VIDEO CODING”. It is noted that even codecs that are not specifically designed to be scalable can exhibit scalability characteristics in the temporal dimension. For example, consider an MPEG-2 Main Profile codec, a non-scalable codec, which is used in DVDs and digital TV environments. Further, assume that the codec is operated at 30 fps and that a group of pictures (GOP) structure of IBBPBBPBBPBBPBB (period N=15 frames) is used. By sequential elimination of the B pictures, followed by elimination of the P pictures, it is possible to derive a total of three temporal resolutions: 30 fps (all picture types included), 10 fps (I and P only), and 2 fps (I only). The sequential elimination process results in a decodable bitstream because the MPEG-2 Main Profile codec is designed so that coding of the P pictures does not rely on the B pictures, and similarly coding of the I pictures does not rely on other P or B pictures. In the following, single-layer codecs with temporal scalability features are considered to be a special case of scalable video coding, and are thus included in the term scalable video coding, unless explicitly indicated otherwise.
Scalable codecs typically have a pyramidal bitstream structure in which one of the constituent bitstreams (called the “base layer”) is essential in recovering the original medium at some basic quality. Use of one or more the remaining bitstream(s) (hereinafter called “the enhancement layer(s)”) along with the base layer increases the quality of the recovered medium. Data losses in the enhancement layers may be tolerable, but data losses in the base layer can cause significant distortions or complete loss of the recovered medium.
Scalable codecs pose challenges similar to those posed by single layer codecs for error resilience and random access. However, the coding structures of the scalable codecs have unique characteristics that are not present in single layer video codecs. Further, unlike single layer coding, scalable coding may involve switching from one scalability layer to another (e.g., switching back and forth between CIF and QCIF resolutions). Instantaneous layer switching when switching between different resolutions with very little bit rate overhead is desirable for random access in scalable coding systems in which multiple signal resolutions (spatial/temporal/quality) may be available from the encoder.
A problem related to those of error resilience and random access is that of rate control. The output of a typical video encoder has a variable bit rate, due to the extensive use of prediction, transform and entropy coding techniques. In order to construct a constant bit rate stream, buffer-constrained rate control is typically employed in a video communication system. In such a system, an output buffer at the encoder is assumed, which is emptied at a constant rate (the channel rate); the encoder monitors the buffer's occupancy and makes parameter selections (e.g., quantizer step size) in order to avoid buffer overflow or underflow. Such a rate control mechanism, however, can only be applied at the encoder, and further assumes that the desired output rate is known. In some video communication applications, including videoconferencing, it is desirable that such rate control decisions are made at an intermediate gateway (e.g., at a Multipoint Control Unit—MCU), which is situated between the sender and the receiver. Bitstream-level manipulation, or transcoding, can be used at the gateway, but at considerable processing and complexity cost. It is therefore desirable to employ a technique that achieves rate control without requiring any additional processing at the intermediate gateway.
Consideration is now being given to improving error resilience and capabilities for random access to the coded bitstreams, and rate control in video communications systems. Attention is directed developing error resilience, rate control, and random access techniques, which have a minimal impact on end-to-end delay and the bandwidth used by the system.