Digital video consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15 or 30 frames per second. Each frame can include tens or hundreds of thousands of pixels (also called pels). Each pixel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel as a set of three samples totaling 24 bits. For instance, a pixel may comprise an 8-bit luminance sample (also called a luma sample) that defines the grayscale component of the pixel and two 8-bit chrominance sample values (also called chroma samples) that define the color component of the pixel. Thus, the number of bits per second, or bitrate, of a typical raw digital video sequence may be five million bits per second or more.
Many computers and computer networks lack the resources to process raw digital video. For this reason, engineers use compression (also called coding or encoding) to reduce the bitrate of digital video. Compression decreases the cost of storing and transmitting video by converting the video into a lower bitrate form. Decompression (also called decoding) reconstructs a version of the original video from the compressed form. A “codec” is an encoder/decoder system. Compression can be lossless, in which the quality of the video does not suffer, but decreases in the bitrate are limited by the inherent amount of variability (sometimes called entropy) of the video data. Or, compression can be lossy, in which quality of the video suffers, but achievable decreases in the bitrate are more dramatic. Lossy compression is often used in conjunction with lossless compression—in a system design in which the lossy compression establishes an approximation of information and lossless compression techniques are applied to represent the approximation.
In general, video compression techniques include “intrapicture” compression and “interpicture” compression, where a picture is, for example, a progressively scanned video frame, an interlaced video frame (having alternating lines for video fields), or an interlaced video field. Generally speaking, video sequences contain a significant amount of redundancy within a given frame, and between sequential frames. For example, the human eye generally does not notice slight differences in otherwise similar backgrounds in successive video frames. Compression exploits these redundancies by removing a certain portion of the redundant material within the bitstream being sent, and then adding them back in at the receiving end when the picture is uncompressed. Two common redundancies that are subtracted out of video frames are spatial and temporal. Spatial redundancies occur between neighboring pixels within a single frame. Frames that are compressed using spatial redundancies, known as intraframes, or I-frames, contain all of the information needed to reconstitute the image within the frame itself—they are self-contained. Frames that use temporal redundancies, such as P-frames and B-frames, require information from other frames to be decoded. P-frames (predictively encoded frames) are encoded, and must be decoded, using information from previous I- and/or P-frames. B-frames (bi-directionally predictively-encoded frames) are encoded using information from both previous and subsequent I- and P-frames. Motion estimation removes temporal redundancy in successive video frames (interframes) by encoding the unique matter along with a motion-predicted image created from a previously-encoded image known as a reference frame. If a reference frame is lost, then its succeeding predictive frames cannot be deciphered—the transmission errors propagate to successive frames.
For progressive frames, intrapicture compression techniques compress individual frames (typically called I-frames or key frames), and interpicture compression techniques compress frames (typically called predicted frames, P-frames, or B-frames) with reference to preceding and/or following frames (typically called reference or anchor frames). I-frames (self-contained) and P-frames (which generally refer to preceding frames) can themselves be reference frames, while B-frames, which refer to both preceding and following frames, are typically never used as references themselves.
When the data is decompressed prior to the resulting video being displayed, a decoder typically performs the inverse of the compression operations. For example, a decoder may perform entropy decoding, inverse quantization, and an inverse transform while decompressing the data. When motion compensation is used, the decoder (and encoder) reconstruct a frame from one or more previously reconstructed frames (which are now used as reference frames), and the newly reconstructed frame may then be used as a reference frame for motion compensation for later frames.
Packets sent through networks are subject to loss—packets are dropped. This loss occurs randomly and unpredictably. Furthermore, video compression stream data is highly sensitive to delay; as the packets need to be reassembled in the same order that they were sent; too many delayed packets give rise to a jumpy, interrupted signal. Transmission delay problems can also occur in resending a new I-frame; the new frame is susceptible to all of the same problems that corrupted the lost reference frame. Furthermore, in lossy coding schemes, the compression is designed to meet a target bitrate for storage and transmission. High compression is achieved by lowering the quality of the reconstituted image. Therefore, any extra loss caused by dropped or delayed packets may degrade the image below an acceptable rate.
The capacity to handle packet loss is crucial for real-time video codec (RTC) to perform well in noisy networks—those susceptible to loss. Most of the existing video codecs cope with packet loss by requesting a new I-frame when a reference frame is lost. Others use slice-based coding, which adds to the signaling overhead. FIG. 1 illustrates traditional packet loss recovery by requesting a new I-frame.
In this prior art method, an I-frame 104 is received at the decoder 102. It is then used to interpret/reconstruct subsequent dependent P-frame 106. The next dependent frame, P-frame 108 is corrupted, as sufficient packets are received out of order or lost. The subsequent P-frame 110 and any following frames can no longer be reconstructed. At the destination node (here represented by decoder 102) an I-frame request is generated and sent to the source node (here represented by encoder 114). Each subsequent P-frame or B-frame cannot be reconstructed until a new I-frame is received. Once the source node 114 receives the request, it assembles a new I-frame 112 and sends it to the destination node 102 using the communications channel. After receiving the new I-frame 112 the destination channel can successfully decode the subsequent P-frames. However, this results in a delay 116 equivalent to the time needed to send the initial request, plus the time to encode the I-frame, plus the time to send the I-frame to the destination node 102. Moreover, a sender and a receiver may be separated by a large amount of physical distance, with the long trip creating a noticeable lag-time with a corresponding degradation in the quality of the video.
In another prior art method, I-frames 104, 112 are sent at regular intervals. When an I-frame 112 is lost due to corruption or delay, the decoder waits until the next I-frame is received, incorrectly decoding the subsequent P-frames 110, 108, 106.
Therefore, there exists a need for improved methods and systems for transmitting compressed video over a lossy packet based network.