1. Field of the Invention
The invention relates to communication of digital video signals over multi-channel packet switched networks and more particularly to a system and method for encoding and decoding video data with low delay to conceal the effects of packet loss on the quality of the video transported over such networks.
2. Description of the Prior Art
In recent years, videoconference applications have begun the transition from transporting compressed audio and video data streams over nearly lossless circuit switched networks such as POTS and ISDN phone lines to packet switched networks. On a packet switched network, data streams are partitioned into smaller data bundles called packets. Packet switched networks often have significantly higher error rates than their circuit switched counterparts.
Error associated with packet switched networks is in the form of lost packets of data, which are supposed to travel over the Internet from source node to destination node. However, given the distributed and computationally simplistic architecture of the Internet and given that the Internet transport policies are only best-effort, it is common for packets to get lost (i.e., to fail to reach their intended destination).
Packet loss in the context of videoconferencing has a negative effect on the video portions of a conference. A loss of as little as one percent of packets containing video data can make a video portion of the conference difficult to comprehend.
Top layers of network protocols can minimize packet loss by using acknowledgement and re-sending procedures. However, while acknowledgement and re-sending procedures may suffice for traditional static web content such as web pages, JPEG images, and applets, it is unsuitable for interactive video, also referred to as conversational video. Interactive video requires that a stream of sequential images arrive at a client's location at a consistent rate that allows for real-time playback with a minimum latency.
There are several methods that attempt to resolve the packet loss and delay problems associated with interactive video by using prioritization and reservation of network resources via Quality of Service (QoS) enabled networks. These methods, including IP Precedence, Diff-Serv, RSVP, and MPLS, can be used to prioritize audio/video data over non-real time traffic (e.g., HTTP and FTP). Another QoS network method and one that provides context for the present invention, uses a multi-channel system that requires compressed video data to be divided and transported over separate channels. In addition, one or more of these channels are guaranteed to have a very low packet loss rate. Typically these high quality channels represent a small fraction of a total bandwidth of the channel. For a multi-channel QoS approach to be effective, it is necessary for the video encoder to make encoding decisions that exploit the special nature of a multi-channel QoS network.
Video encoding algorithms in use today, such as MPEG, MPEG2, MPEG4, H.261, and H.263 employ techniques that are based on a concept of a block. FIG. 1 depicts a relationship, in these common encoding algorithms, between a video sequence 100, an individual video picture/frame 110, and a constituent block 120. The video sequence 100 is composed of the individual frames 110. The frame 110 is subsequently composed of a grid of blocks 120, which preferably are 8 pixel by 8 pixel fragments of the frame 110. Alternatively, video encoding algorithms may employ techniques based on a concept of a macroblock, a collection of six blocks (not shown). Four blocks are spatially situated to cover a 16 pixel by 16 pixel fragment of the frame 110 containing luminance information and two blocks contain chrominance information.
FIG. 2 depicts some critical concepts in the video encoding art. A depicted video sequence 200 comprises of individual frames 201 through 213. The frames 201–213, in their most elemental form, are conglomerations of pixel values (values measuring color and luminosity of an individual pixel). To store and transport the frames 201–213 in terms of pure pixel values requires memory and bandwidth amounts that exceed practical limits for real-time playback of a video sequence over a network. Encoding methods address this problem, in part, by taking advantage of spatial and temporal redundancies present in the sequence of the frames 201–213. In other words, pixel values are not independent and random with respect to each other, neither within a frame nor across frames. Rather, pixel values correlate with pixel values that are proximate in the frame (spatial predictability) and across frames (temporal predictability). This nature of frame sequences makes it possible to develop encoding algorithms that can reduce the memory and bandwidth requirements by substituting predicated frames for the full pixel valued frames.
Frames are encoded (i.e., converted from a pixel-valued-format to a compressed format) on the basis of individual blocks 120 (FIG. 1) or macroblocks (not shown). The blocks 120 of the frame 110 (FIG. 1) are encoded with either a transform technique or a motion compensation/transform technique.
The transform technique is used where the blocks 120 cannot be predicted from a previous set of blocks (e.g., a scene cut). A frame encoded with the transform technique is referred to as an intra-picture or I frame because all compression is derived solely from intra-frame/spatial predictability, as opposed to inter-frame/temporal predictability.
Alternatively, the motion compensation/transform technique, also simply referred to as motion compensation, is used to encode blocks 120 in a manner that eliminates temporal redundancy (i.e., exploits the predictability of blocks across frames). This motion compensation substitutes a block's pixel values with a motion vector (that points from the block being coded to a reference block with a similar pixel arrangement) and transform coded residual terms (which are the content difference between the chosen reference block and the block being coded). For example, frames, which are coded with the motion compensation/transform technique, are referred to as B frames and predicted (P) frames. P frames use only previous frames for reference. B frames use both previous and subsequent P or I frames for reference. The advantage of using B frames over P frames is that B frames produce, in general, a more accurate frame prediction thereby increasing coding efficiency. The disadvantage of using B frames is the playback delay caused by having to load subsequent P or I frames before a B frame can be decoded and rendered for the viewer.
Referring back to FIG. 2, the exemplary video sequence 200 is depicted where frames 201 through 213 are displayed in sequential order. The frames 201, 207, and 213 are I frames. The frames 202, 203, 205, 206, 208, 209, 211, and 212 are B frames. The frames 204 and 210 are P frames. Thus, the frame 202 is dependent on both frames 201 and 204. The frame 204 is dependent on the frame 201. Given these frame dependencies, the frames 201–204 must be loaded into a decoder in the following order: 201, 204, 202, and 203. Arrows in FIG. 2 depict a similar frame dependency and frame load order for the remaining frames 205–213. FIG. 2 serves to illustrate how B frames introduce video playback latency, because B frames can be loaded and played only after first loading subsequent frame dependencies.
When a packet containing video is lost, the decoder encounters an error. In most encoder-decoder (codec) implementations, these decoder errors will propagate to succeeding video pictures until an intra-picture is loaded and decoded. Video conferencing uses a fixed bit rate, and since intra-pictures require many more bits to encode than non-intra-pictures, the intra-pictures are sent much less frequently. Indeed, in many implementations of H.261 and H.263, intra-pictures are sent only when a decoder error occurs and the decoder has signaled the encoder to send an intra-picture. This error handling strategy produces good results only in low loss networks, but not in packet switched networks.
In light of the detrimental effects of packet loss or delay on encoded video data, there exists a need in the art of videoconferencing for a method to minimize the effect of packet loss on video without adding delay to the received video. The present invention provides a method and system for encoding video with low delay for transport over a multi-channel QoS packet switched network so as to exploit special properties of that network.