Digital video consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15 or 30 frames per second. Each frame can include tens or hundreds of thousands of pixels (also called pels). Each pixel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel as a set of three samples totaling 24 bits. For instance, a pixel may comprise an 8-bit luminance sample (also called a luma sample) that defines the grayscale component of the pixel and two 8-bit chrominance sample values (also called chroma samples) that define the color component of the pixel. Thus, the number of bits per second, or bit rate, of a typical raw digital video sequence may be 5 million bits per second or more.
Many computers and computer networks lack the resources to process raw digital video. For this reason, engineers use compression (also called coding or encoding) to reduce the bit rate of digital video. Compression decreases the cost of storing and transmitting video by converting the video into a lower bit rate form. Decompression (also called decoding) reconstructs a version of the original video from the compressed form. A “codec” is an encoder/decoder system. Compression can be lossless, in which quality of the video does not suffer, but decreases in the bit rate are limited by the inherent amount of variability (sometimes called entropy) of the video data. Or, compression can be lossy, in which quality of the video suffers, but achievable decreases in the bit rate are more dramatic. Lossy compression is often used in conjunction with lossless compression—in a system design in which the lossy compression establishes an approximation of information and lossless compression techniques are applied to represent the approximation.
In general, video compression techniques include “intra-picture” compression and “inter-picture” compression, where a picture is, for example, a progressively scanned video frame, an interlaced video frame (having alternating lines for video fields), or an interlaced video field. For progressive frames, intra-picture compression techniques compress individual frames (typically called I-frames or key frames), and inter-picture compression techniques compress frames (typically called predicted frames, P-frames, or B-frames) with reference to preceding and/or following frames (typically called reference or anchor frames).
The predicted frames may be divided into regions called macroblocks. A matching region in a reference frame for a particular macroblock is specified by sending motion vector information for the macroblock. A motion vector indicates the location of the region in the reference frame whose pixels are to be used as a predictor for the pixels current macroblock. The pixel-by-pixel difference, often called the error signal or residual, between the current macroblock (or the blocks thereof) and the macroblock predictor is derived. This error signal usually has lower entropy than the original signal. Therefore, the information can be encoded at a lower rate. An encoder performs motion estimation by determining a motion vector for a region of a frame by searching for a matching region in one or more reference frames to use as a predictor. An encoder or decoder performs motion compensation by applying the motion vector to find the predictor in the one or more reference frames.
The motion vector value for a macroblock is often correlated with the motion vectors for spatially surrounding macroblocks. Thus, compression of the data used to transmit the motion vector information can be achieved by coding the differential between the motion vector and a motion vector predictor formed from neighboring motion vectors.
Often in video compression techniques, blocks of pixels or other spatial domain video data such as residuals are transformed into transform domain data, which is often frequency domain (i.e., spectral) data. The resulting blocks of spectral data coefficients may be quantized and then entropy encoded.
When the data is decompressed prior to the resulting video being displayed, a decoder typically performs the inverse of the compression operations. For example, a decoder may perform entropy decoding, inverse quantization, and an inverse transform while decompressing the data. When motion compensation is used, the decoder (and encoder) reconstruct a frame from one or more previously reconstructed frames (which are now used as reference frames), and the newly reconstructed frame may then be used as a reference frame for motion compensation for later frames.
Many typical usage scenarios for digitally coded video involve transmission of the coded video between devices, and frequently between geographically distant locations. Further, many commonly used data transmission systems use packet-based transmission protocols, in which a data transmission is divided into separately routed units called “packets.” These various transmission systems that carry digital video are often subject to noise and other sources of transmission errors, and can experience “packet loss.” Such errors and packet loss can lead to failure to decode an individual frame, or multiple related frames of the video sequence.
It can therefore be desirable to encode partial regions of a picture in a video sequence as an independently decodable unit. This helps enable packetization of the video stream. Further, this introduces additional redundancy in the compressed video bitstream that increases its resilience to transmission errors and packet loss. For example, the decoding loss from a transmission error or lost packet can be limited to the partial region, instead of a full picture of the video sequence. However, this resilience is achieved at the cost of compression efficiency.
Numerous companies have produced video codecs. For example, Microsoft Corporation has produced a video encoder and decoder released for Windows Media Video 8. Aside from these products, numerous international standards specify aspects of video decoders and formats for compressed video information. These standards include the H.261, MPEG-1, H.262, H.263, and MPEG-4 standards. Directly or by implication, these standards also specify certain encoder details, but other encoder details are not specified. These products and standards use (or support the use of) different combinations of the compression and decompression techniques described above. In particular, these products and standards provide various techniques for partial picture unit coding.
One such technique divides a frame within the video sequence into slices. A slice is defined to contain one or more contiguous rows of macroblocks in their original left-to-right order. A slice begins at the first macroblock of a row, and ends at the last macroblock of the same or another row.
Various standards, e.g., MPEG-1, MPEG-2, H.263 (with GOBs being roughly equivalent to slices or with Annex K slice structured coding mode), MPEG-4 part 2 and H.264/JVT/MPEG-4part10, all have slices as part of their syntax. Among these, all of them disable intra prediction and motion vector prediction and most other forms of prediction across slice boundaries for error/loss robustness reasons. Among these, only H.263 (Annex J) and H.264/JVT include loop filters. H.263 handling of interlace is rather primitive (field coding only using Annex W supplemental enhancement indications). H.264 has a more error-robust header structure and allows the encoder to select whether or not loop filtering is to be applied across slice boundaries
The implementation of slices in these various video decoding standards each strike a different balance between resiliency and coding efficiency.