Providing high quality digital video communications between senders and receivers over packet-based modern communication networks (e.g., a network based on the Internet Protocol (IP)) is technically challenging, at least due to the fact that data transport on such networks is typically carried out on a best-effort basis. Transmission errors in modern communication networks generally manifest themselves as packet losses and not as bit errors, which were characteristic of earlier communication systems. The packet losses often are the result of congestion in intermediary routers, and not the result of physical layer errors.
When a transmission error occurs in a digital video communication system, it is important to ensure that the receiver can quickly recover from the error and return to an error-free display of the incoming video signal. However, in typical digital video communication systems, the receiver's robustness is reduced by the fact that the incoming data is heavily compressed in order to conserve bandwidth. Further, the video compression techniques employed in the communication systems (e.g., state-of-the-art codecs ITU-T H.264 and H.263 or ISO MPEG-2 and MPEG-4 codecs) can create a strong temporal dependency between sequential video packets or frames. In particular, use of motion compensated prediction (e.g., involving the use of P or B frames) codecs creates a chain of frame dependencies in which a displayed frame depends on past frame(s). The chain of dependencies can extend all the way to the beginning of the video sequence. As a result of the chain of dependencies, the loss of a given packet can affect the decoding of a number of the subsequent packets at the receiver. Error propagation due to the loss of the given packet terminates only at an “intra” (I) refresh point, or at a frame which does not use any temporal prediction at all.
Error resilience in digital video communication systems requires having at least some level of redundancy in the transmitted signals. However, this requirement is contrary to the goals of video compression techniques, which strive to eliminate or minimize redundancy in the transmitted signals.
On a network that offers differentiated services (e.g., DiffServ IP-based networks, private networks over leased lines, etc.), a video data communication application may exploit network features to deliver some or all of video signal data in a lossless or nearly lossless manner to a receiver. However, in an arbitrary best-effort network (such as the Internet) that has no provision for differentiated services, a data communication application has to rely on its own features for achieving error resilience. Known techniques (e.g., the Transmission Control Protocol—TCP) that are useful in text or alpha-numeric data communications are not appropriate for video or audio communications, which have the added constraint of low end-to-end delay arising out of human interface requirements. For example, TCP techniques may be used for error resilience in text or alpha-numeric data transport. TCP keeps on retransmitting data until confirmation that all data is received, even if it involves a delay of several seconds. However, TCP is inappropriate for video data transport in a live or interactive videoconferencing application because the end-to-end delay, which is unbounded, would be unacceptable to participants.
An aspect of error resilience in video communication systems relates to random access (e.g., when a receiver joins an existing transmission of a video signal), which has a considerable impact on compression efficiency. Instances of random access are, for example, a user who joins a videoconference, or a user who tunes in to a broadcast. Such a user would have to find a suitable point in the incoming bitstream signal to start decoding and be synchronized with the encoder. A random access point is effectively an error resilience feature since at that point any error propagation terminates (or is an error recovery point). Thus, a particular coding scheme, which provides good random access support, will generally have an error resilience technique that provides for faster error recovery. However, the converse depends on the specific assumptions about the duration and extent of the errors that the error resilience technique is designed to address. The error resilience technique may assume that some state information is available at the receiver at the time an error occurs. In such case, the error resilience technique does not assure good random access support.
In MPEG-2 video codecs for digital television systems (digital cable TV or satellite TV), I pictures are used at periodic intervals (typically 0.5 sec) to enable fast switching into a stream. The I pictures, however, are considerably larger than their P or B counterparts (typically by 3-6 times) and are thus to be avoided, especially in low bandwidth and/or low delay applications.
In interactive applications such as videoconferencing, the concept of requesting an intra update is often used for error resilience. In operation, the update involves a request from the receiver to the sender for an intra picture transmission, which enables the decoder to be synchronized. The bandwidth overhead of this operation is significant. Additionally, this overhead is also incurred when packet errors occur. If the packet losses are caused by congestion, then the use of the intra pictures only exacerbates the congestion problem.
Another traditional technique for error robustness, which has been used in the past to mitigate drift caused by mismatch in IDCT implementations (e.g., in the H.261 standard), is to periodically code each macroblock intra mode. The H.261 standard requires forced intra coding every 132 times a macroblock is transmitted.
The coding efficiency decreases with increasing percentage of macroblocks that are forced to be coded as intra in a given frame. Conversely, when this percentage is low, the time to recover from a packet loss increases. The forced intra coding process requires extra care to avoid motion-related drift, which further limits the encoder's performance since some motion vector values have to be avoided, even if they are the most effective.
In addition to traditional, single-layer codecs, layered or scalable coding is a well-known technique in multimedia data encoding. Scalable coding is used to generate two or more “scaled” bitstreams collectively representing a given medium in a bandwidth-efficient manner. Scalability can be provided in a number of different dimensions, namely temporally, spatially, and quality (also referred to as SNR “Signal-to-Noise Ratio” scalability). For example, a video signal may be scalably coded in different layers at CIF and QCIF resolutions, and at frame rates of 7.5, 15, and 30 frames per second (fps). Depending on the codec's structure, any combination of spatial resolutions and frame rates may be obtainable from the codec bitstream. The bits corresponding to the different layers can be transmitted as separate bitstreams (i.e., one stream per layer) or they can be multiplexed together in one or more bitstreams. For convenience in description herein, the coded bits corresponding to a given layer may be referred to as that layer's bitstream, even if the various layers are multiplexed and transmitted in a single bitstream. Codecs specifically designed to offer scalability features include, for example, MPEG-2 (ISO/IEC 13818-2, also known as ITU-T H.262) and the currently developed H.264 Scalable Video Coding extension (known as ITU-T H.264 Annex G or MPEG-4 Part 10 SVC). Scalable video coding (SVC) techniques specifically designed for video communication are described in commonly assigned international patent application No. PCT/US06/028365 “SYSTEM AND METHOD FOR SCALABLE AND LOW-DELAY VIDEOCONFERENCING USING SCALABLE VIDEO CODING”. It is noted that even codecs that are not specifically designed to be scalable can exhibit scalability characteristics in the temporal dimension. For example, consider an MPEG-2 Main Profile codec, a non-scalable codec, which is used in DVDs and digital TV environments. Further, assume that the codec is operated at 30 fps and that a GOP structure of IBBPBBPBBPBBPBB (period N=15 frames) is used. By sequential elimination of the B pictures, followed by elimination of the P pictures, it is possible to derive a total of three temporal resolutions: 30 fps (all picture types included), 10 fps (I and P only), and 2 fps (I only). The sequential elimination process results in a decodable bitstream because the MPEG-2 Main Profile codec is designed so that coding of the P pictures does not rely on the B pictures, and similarly coding of the I pictures does not rely on other P or B pictures. In the following, single-layer codecs with temporal scalability features are considered to be a special case of scalable video coding, and are thus included in the term scalable video coding, unless explicitly indicated otherwise.
Scalable codecs typically have a pyramidal bitstream structure in which one of the constituent bitstreams (called the “base layer”) is essential in recovering the original medium at some basic quality. Use of one or more the remaining bitstream(s) (called “the enhancement layer(s)”) along with the base layer increases the quality of the recovered medium. Data losses in the enhancement layers may be tolerable, but data losses in the base layer can cause significant distortions or complete loss of the recovered medium.
Scalable codecs pose challenges similar to those posed by single layer codecs for error resilience and random access. However, the coding structures of the scalable codecs have unique characteristics that are not present in single layer video codecs. Further, unlike single layer coding, scalable coding may involve switching from one scalability layer to another (e.g., switching back and forth between CIF and QCIF resolutions).
Simulcasting is a coding solution for videoconferencing that is less complex than scalable video coding but has some of the advantages of the latter. In simulcasting, two different versions of the source are encoded (e.g., at two different spatial resolutions) and transmitted. Each version is independent, in that its decoding does not depend on reception of the other version. Like scalable and single-layer coding, simulcasting poses similar random access and robustness issues. In the following, simulcasting is considered a special case of scalable coding (where no inter layer prediction is performed) and both are referred to simply as scalable video coding techniques unless explicitly indicated otherwise.
Specific techniques for providing error resilience and random access in video communication systems are described in commonly assigned International patent application Nos. PCT/US06/061815, “SYSTEMS AND METHODS FOR ERROR RESILIENCE AND RANDOM ACCESS IN VIDEO COMMUNICATIONS SYSTEMS,” and PCT/US07/063335, “SYSTEM AND METHOD FOR PROVIDING ERROR RESILIENCE, RANDOM ACCESS, AND RATE CONTROL IN SCALABLE VIDEO COMMUNICATIONS.” Among other things, these patent applications disclose the concept of LR pictures, i.e., pictures that constitute the lowest temporal layer of a scalably coded video signal (at the lowest spatial or quality resolution) and which are transmitted reliably from a sender to a receiver. Reliable transmission of the LR pictures ensures a minimum level of quality at a receiving decoder. A receiver can immediately detect if an LR picture has been lost and take steps to obtain the lost picture (e.g., by requesting its retransmission from the sender) using, for example, a “key picture indices” mechanism, which is also disclosed in International patent application No. PCT/US06/061815. It is noted that the sender and receiver are not necessarily the encoder and decoder, respectively, but may be a Scalable Video Communication Server (SVCS) as disclosed in commonly assigned International patent application No. PCT/US06/028366, a Compositing SVCS (CSVCS) as disclosed in commonly assigned International patent application No. PCT/US06/62569, or a Multicast SVCS (MSVCS) as disclosed in commonly assigned International patent application No. PCT/US07/80089.
A potential limitation of the systems and methods described in International patent application No. PCT/US06/061815 occurs when the lowest temporal level pictures are transported over more than one packets. This may occur, for example, in coding high-definition video, where each frame may be transported using more than one transport-layer packets, or when a picture is coded using more than one slices and each slice is transported in its own packet. In these cases, all packets belonging to the same frame will have the same key picture index. If all slices are lost due to packet losses in the network, then a receiver can properly detect the loss of the entire picture and initiate corrective action. If, however, few or all of the slices are received, then a receiver can not immediately infer if the received slices contain the entire or only a partial picture, unless it proceeds to decode the slice data. This inference is straightforward in a receiver that decodes the received data, but it presents significant complexity for an intermediate receiver (e.g., an SVCS, CSVCS, or MSCVS, or any Media-Aware Network Element—MANE) that is normally not equipped to perform decoding of the video data.
Consideration is now being given to improving error resilience to the coded bitstreams in video communications systems. Attention is directed towards developing error resilience techniques which have a minimal impact on end-to-end delay and the bandwidth used by the system, and address the possibility of fragmentation of coded video data in multiple slices. Desirable error resilience techniques will be applicable to both scalable and single-layer video coding.