This invention relates to systems and methods for coding video data, and more particularly, to motion-compensation-based video coding schemes that employ error resilience techniques in the enhancement layer bitstream.
Efficient and reliable delivery of video data is increasingly important as the Internet and wireless channel networks continue to grow in popularity. Video is very appealing because it offers a much richer user experience than static images and text. It is more interesting, for example, to watch a video clip of a winning touchdown or a Presidential speech than it is to read about the event in stark print. Unfortunately, video data is significantly larger than other data types commonly delivered over the Internet. As an example, one second of uncompressed video data may consume one or more Megabytes of data.
Delivering such large amounts of data over error-prone networks, such as the Internet and wireless networks, presents difficult challenges in terms of both efficiency and reliability. These challenges arise as a result of inherent causes such as bandwidth fluctuations, packet losses, and channel errors. For most Internet applications, packet loss is a key factor that affects the decoded visual quality. For wireless applications, wireless channels are typically noisy and suffer from a number of channel degradations, such as random errors and burst errors, due to fading and multiple path reflections. Although the Internet and wireless channels have different properties of degradations, the harms are the same to the video bitstream. One or multiple video packet losses may cause some consecutive macroblocks and frames to be undecodable.
To promote efficient delivery, video data is typically encoded prior to delivery to reduce the amount of data actually being transferred over the network. Image quality is lost as a result of the compression, but such loss is generally tolerated as necessary to achieve acceptable transfer speeds. In some cases, the loss of quality may not even be detectable to the viewer.
Video compression is well known. One common type of video compression is a motion-compensation-based video coding scheme, which is used in such coding standards as MPEG-1, MPEG-2, MPEG-4, H.261, and H.263.
One particular type of motion-compensation-based video coding scheme is a layer-based coding schemed, such as fine-granularity layered coding. Layered coding is a family of signal representation techniques in which the source information is partitioned into sets called xe2x80x9clayersxe2x80x9d. The layers are organized so that the lowest, or xe2x80x9cbase layerxe2x80x9d, contains the minimum information for intelligibility. The other layers, called xe2x80x9cenhancement layersxe2x80x9d, contain additional information that incrementally improves the overall quality of the video. With layered coding, lower layers of video data are often used to predict one or more higher layers of video data.
The quality at which digital video data can be served over a network varies widely depending upon many factors, including the coding process and transmission bandwidth. xe2x80x9cQuality of Servicexe2x80x9d, or simply xe2x80x9cQoSxe2x80x9d, is the moniker used to generally describe the various quality levels at which video can be delivered. Layered video coding schemes offer a wide range of QoSs that enable applications to adopt to different video qualities. For example, applications designed to handle video data sent over the Internet (e.g., multi-party video conferencing) must adapt quickly to continuously changing data rates inherent in routing data over many heterogeneous sub-networks that form the Internet. The QoS of video at each receiver must be dynamically adapted to whatever the current available bandwidth happens to be. Layered video coding is an efficient approach to this problem because it encodes a single representation of the video source to several layers that can be decoded and presented at a range of quality levels.
Apart from coding efficiency, another concern for layered coding techniques is reliability. In layered coding schemes, a hierarchical dependence exists for each of the layers. A higher layer can typically be decoded only when all of the data for lower layers or the same layer in the previous prediction frame is present. If information at a layer is missing, any data for the same or higher layers is useless. In network applications, this dependency makes the layered encoding schemes very intolerant of packet loss, especially at the lower layers. If the loss rate is high in layered streams, the video quality at the receiver is very poor.
FIG. 1 depicts a conventional layered coding scheme 100, known as xe2x80x9cfine-granularity scalablexe2x80x9d or xe2x80x9cFGSxe2x80x9d. Three frames are shown, including a first or intraframe 102 followed by two predicted frames 104 and 106 that are predicted from the intraframe 102. The frames are encoded into four layers: a base layer 108, a first layer 110, a second layer 112, and a third layer 114. The base layer 108 typically contains the video data that, when played, is minimally acceptable to a viewer. Each additional layer 110-114, also known as xe2x80x9cenhancement layersxe2x80x9d, contains incrementally more components of the video data to enhance the base layer. The quality of video thereby improves with each additional enhancement layer. This technique is described in more detail in an article by Weiping Li, entitled xe2x80x9cFine Granularity Scalability Using Bit-Plane Coding of DCT Coefficientsxe2x80x9d, ISO/IEC JTC1/SC29/WG11, MPEG98/M4204 (December 1998).
One characteristic of the FGS coding scheme illustrated in FIG. 1 is that the enhancement layers 110-114 in the predicted frames can be predictively coded from the base layer 108 in a preceding reference frame. In this example, the enhancement layers of predicted frame 104 can be predicted from the base layer of intraframe 102. Similarly, the enhancement layers of predicted frame 106 can be predicted from the base layer of preceding predicted frame 104.
With layered coding, the various layers can be sent over the network as separate sub-streams, where the quality level of the video increases as each sub-stream is received and decoded. The base layer 108 is sent as one bitstream and one or more enhancement layers 110-114 are sent as one or more other bitstreams.
FIG. 2 illustrates the two bitstreams: a base layer bitstream 200 containing the base layer 108 and an enhancement layer bitstream 202 containing the enhancement layers 110-114. Generally, the base layer is very sensitive to any packet losses and errors and hence, any errors in the base bitstream 200 may cause a decoder to lose synchronization and propagate errors. Accordingly, the base layer bitstream 200 is transmitted in a well-controlled channel to minimize error or packet-loss. The base layer is encoded to fit in the minimum channel bandwidth and is typically protected using error protection techniques, such as FEC (Forward Error Correction) techniques. The goal is to deliver and decode at least the base layer 108 to provide minimal quality video.
Research has been done on how to integrate error protection and error recovery capabilities into the base layer syntax. For more information on such research, the reader is directed to R.Talluri, xe2x80x9cError-resilient video coding in the ISO MPEG-4 standardxe2x80x9d, IEEE communications Magazine, pp112-119, June, 1998; and Y. Wang, Q. F. Zhu, xe2x80x9cError control and concealment for video communication: A reviewxe2x80x9d, Proceeding of the IEEE, vol. 86, no. 5, pp 974-997, May, 1998.
The enhancement layer bitstream 202 is delivered and decoded, as network conditions allow, to improve the video quality (e.g., display size, resolution, frame rate, etc.). In addition, a decoder can be configured to choose and decode a particular portion or subset of these layers to get a particular quality according to its preference and capability.
The enhancement layer bitstream 202 is normally very robust to packet losses and/or errors. The enhancement layers in the FGS coding scheme provide an example of such robustness. The bitstream is transmitted with frame marks 204 that demarcate each frame in the bitstream (FIG. 2). If a packet loss or error 206 occurs in the enhancement layer bitstream 202, the decoder simply drops the rest of the enhancement layer bitstream for that frame and searches for the next frame mark to start the next frame decoding. In this way, only one frame of enhancement data is lost. The base layer data for that frame is not lost since it resides in a separate bitstream 200 with its own error detection and correction. As a result, occasionally dropping portions of the enhancement layer bitstream 202 does not result in any annoying visual artifacts or error propagations.
Therefore, the enhancement layer bitstream 202 is not normally encoded with any error detection and error protection syntax. However, errors in the enhancement bitstream 202 cause a very dramatic decrease in bandwidth efficiency. This is because the rate of video data transfer is limited by channel error rate rather than by channel bandwidth. Although the channel bandwidth may be very broad, the actual data transmission rates are very small due to the fact that the rest of the stream is discarded whenever an error is detected in the enhancement layer bitstream.
Accordingly, there is a need for new methods and systems that improve the error resilience of the enhancement layer to thereby improve bandwidth efficiency. However, any such improvements should minimize any additional overhead in the enhancement bitstream.
Prior to describing such new solutions, however, it might be helpful to provide a more detailed discussion of one approach to model packet loss or errors that might occur in the enhancement layer bitstream. FIG. 3 shows a state diagram for a two-state Markov model 300 proposed in E. N. Gilbert, xe2x80x9cCapacity of a Burst-Noise Channelxe2x80x9d, Bell System Technical Journal, 1960, which can be used to simulate both packet losses in an Internet channel and symbol errors in a wireless channel. This model characterizes the loss or error sequences generated by data transmission channels. Losses or errors occur with low probability in a good state (G), referenced as number 302, and occur with high probability in bad state (B), referenced as number 304. The losses or errors occur in cluster or bursts with relatively long error free intervals (gaps) between them. The state transitions are shown in FIG. 3 and summarized by its transition probability matrix P:   P  =      [                            α                                      1            -            α                                                            1            -            β                                    β                      ]  
This model can be used to generate the cluster and burst sequences of packet losses or symbol errors. In this case, it is common to set xcex1≈1 and xcex2=0.5. The random packet losses and symbol errors are a special case for the model 400. Here, the model parameters can be set xcex1≈1 and xcex2=1, where the error rate is 1xe2x88x92xcex1.
The occupancy times in good state G are important to deliver the enhancement bitstream. So we define a Good Run Length (GRL) as the length of good symbols between adjacent error points. The distributions of the good run length are subject to a geometrical relationship given by M. Yajnik, xe2x80x9cMeasurement and Modeling of the Temporal Dependence in Packet Lossxe2x80x9d, UMASS CMPSCI Technical Report #98-78:
p(k)=(1xe2x88x92xcex1)xcex1kxe2x88x921 k=1, 2, ∞
Thus, the mean of GRL should be:                     m        =                              lim                          N              →              ∞                                ⁢                                    ∑                              k                =                1                            N                        ⁢                          k              xc3x97                              p                ⁡                                  (                  k                  )                                                                                            =                              lim                          N              →              ∞                                ⁢                                    1              -                              α                N                                                    1              -              α                                          
Since xcex1 is always less than 1, the above mean of GRL is close to (1xe2x88x92xcex1)xe2x88x921. In other words, the average length of continuous good symbols is (1xe2x88x92xcex1)xe2x88x921 when the enhancement bitstream is transmitted over this channel.
In a common FGS or PFGS enhancement bitstream, there are no additional error protection and error recovery capacities. Once there are packet losses and errors in the enhancement bitstream, the decoder simply drops the rest of the enhancement layer bitstream of that frame and searches for the next synchronized marker. Therefore, the correct decoded bitstream in every frame lies between the frame header and the location where the first error occurred. According to the simulation channel modeled above, although the channel bandwidth may be very broad, the average decoded length of enhancement bitstream is only (1xe2x88x92xcex1)xe2x88x921 symbols. Similarly, the mean of bad run length is close to (1xe2x88x92xcex2)xe2x88x921. In other words, the occupancy times for good state and bad state are both geometrically distributed with respective mean (1xe2x88x92xcex1)1 and (1xe2x88x92xcex2)xe2x88x921. Thus, the average symbol error rate is produced by the two-state Markov model is:   er  =            1      -      α              1      -      α      +      1      -      β      
To demonstrate what a value for (1xe2x88x92xcex1)xe2x88x921 in a typically wireless channel might be, suppose the average symbol error rate is 0.01 and its fading degree is 0.6. The corresponding parameter xcex2 of the two-state Markov model 400 is 0.6 (equal to the fading degree) and the parameter xcex1 is about 0.996, calculated using above formula. In such a wireless channel, the effective data transmitted (i.e., the good run length) is always about 250 symbols per frame. Generally, each symbol consists of 8 bits in the channel coding and transmission. Thus, the effective transmitted data per frame is around 2,000 bits (i.e., 250 symbolsxc3x978 bits/symbol). The number of transmitted bits per frame as predicted by channel bandwidth would be far larger than this number.
Our experimental results also demonstrate that the number of actual decoded bits per every frame is almost a constant (e.g., about 5000 bits) in various channel bandwidths (the number of bits determined by channel bandwidth is very large compared to this value). Why are the actual decoded bits at the decoder more than the 2,000 bits (the theoretical value)? The reason for this discrepancy is that there are no additional error detection/protection tools in enhancement bitstream. Only variable length table has a very weak capacity to detect errors. Generally, the location in the bitstream where the error is detected is not the same location where the error has actually occurred. Generally, the location where an error is detected is far from the location where the error actually occurred.
It is noted that the results similar to those of the above burst error channel can be achieved for packet losses and random errors channel. Analysis of random error channel is relatively simple in that the mean of GRL is the reciprocal of the channel error rate. The analysis of packet loss, however, is more complicated. Those who are interested in a packet loss analysis are directed to M.Yajnik, xe2x80x9cMeasurement and Modeling of the Temporal Dependence in Packet Lossxe2x80x9d, UMASS CMPSCI Technical Report #98-78. In short, when the enhancement bitstream is delivered through packet loss channel or wireless channel, the effective data transmitted rate is only determined by channel error conditions, but not by channel bandwidth.
A video coding scheme employs a scalable layered coding, such as progressive fine-granularity scalable (PFGS) layered coding, to encode video data frames into multiple layers. The layers include a base layer of comparatively low quality video and multiple enhancement layers of increasingly higher quality video.
The video coding scheme adds error resilience to the enhancement layer to improve its robustness. In the described implementation, in addition to the existing start codes associated with headers of each video-of-plane (VOP) and each bit plane, more unique resynchronization marks are inserted into the enhancement layer bitstream, which partition the enhancement layer bitstream into more small video packets. With the addition of many resynchronization marks within each frame of video data, the decoder can recover very quickly and with minimal data loss in the event of a packet loss or channel error in the received enhancement layer bitstream.
As the decoder receives the enhancement layer bitstream, the decoder attempts to detect any errors in the packets. Upon detection of an error, the decoder seeks forward in the bitstream for the next known resynchronization mark. Once this mark is found, the decoder is able to begin decoding the next video packet.
The video coding scheme also facilitates redundant encoding of header information from the higher level VOP header down into lower level bit plane headers and video packet headers. Header extension codes are added to the bit plane and video packet headers to identify whether the redundant data is included. If present, the redundant data may be used to check the accuracy of the VOP header data or recover this data in the event the VOP header is not correctly received.
For delivery over the Internet or wireless channel, the enhancement layer bitstream is packed into multiple transport packets. Video packets at the same location, but belonging to different enhancement layers, are packed into the same transport packet. Every transport packet can comprise one or multiple video packets in the same enhancement layers subject to the enhancement bitstream length and the transport packet size. Additionally, video packets with large frame correlations are packed into the same transport packet.