Digital video consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15 or 30 frames per second. Each frame can include tens or hundreds of thousands of pixels (also called pels). Each pixel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel as a set of three samples totaling 24 bits, although pixels of greater color depth can be represented by samples totaling 48 bits or more. Thus, the number of bits per second, or bit rate, of a typical raw digital video sequence can be 5 million bits/second or more.
Many computers and computer networks lack the resources to process raw digital video. For this reason, engineers use compression (also called coding or encoding) to reduce the bit rate of digital video. Compression decreases the cost of storing and transmitting video by converting the video into a lower bit rate form. Decompression (also called decoding) reconstructs a version of the original video from the compressed form. A “codec” is an encoder/decoder system. Compression can be lossless, in which quality of the video does not suffer but decreases in bit rate are limited by the inherent amount of variability (sometimes called entropy) of the video data. Or, compression can be lossy, in which quality of the video suffers but achievable decreases in bit rate are more dramatic. Lossy compression is often used in conjunction with lossless compression—in a system design in which the lossy compression establishes an approximation of information and lossless compression techniques are applied to represent the approximation.
In general, video compression techniques include “intra-picture” compression and “inter-picture” compression, where a picture is, for example, a progressively scanned video frame. For progressive video frames, intra-frame compression techniques compress individual frames (typically called I-frames or key frames). Inter-frame compression techniques compress frames (typically called predicted frames, P-frames, or B-frames for bi-directional prediction) with reference to preceding and/or following frames (typically called reference or anchor frames).
Encoded video bitstreams are often comprised of several syntax layers. Syntax elements that encode characteristics of a video bitstream are divided among the several layers depending on the desired scope of the characteristics. For example, a sequence layer syntax element typically applies to all pictures in a sequence, whereas a picture layer syntax element generally will affect only one corresponding picture within the sequence.
I. Interlaced Video and Progressive Video
A video frame contains lines of spatial information of a video signal. For progressive video, these lines contain samples starting from one time instant and continuing in raster scan fashion through successive lines to the bottom of the frame. A progressive I-frame is an intra-coded progressive video frame. A progressive P-frame is a progressive video frame coded using forward prediction, and a progressive B-frame is a progressive video frame coded using bi-directional prediction.
The primary aspect of interlaced video is that the raster scan of an entire video frame is performed in two passes by scanning alternate lines in each pass. For example, the first scan is made up of the even lines of the frame and the second scan is made up of the odd lines of the scan. This results in each frame containing two fields representing two different time epochs. FIG. 1 shows an interlaced video frame 100 that includes top field 110 and bottom field 120. In the frame 100, the even-numbered lines (top field) are scanned starting at one time (e.g., time t), and the odd-numbered lines (bottom field) are scanned starting at a different (typically later) time (e.g., time t+1). This timing can create jagged tooth-like features in regions of an interlaced video frame where motion is present when the two fields are scanned starting at different times. For this reason, interlaced video frames can be rearranged according to a field structure, with the odd lines grouped together in one field, and the even lines grouped together in another field. This arrangement, known as field coding, is useful in high-motion pictures for reduction of such jagged edge artifacts. On the other hand, in stationary regions, image detail in the interlaced video frame may be more efficiently preserved without such a rearrangement. Accordingly, frame coding is often used in stationary or low-motion interlaced video frames, in which the original alternating field line arrangement is preserved.
A typical progressive video frame consists of one frame of content with non-alternating lines. In contrast to interlaced video, progressive video does not divide video frames into separate fields, and an entire frame is scanned left to right, top to bottom starting at a single time.
II. Sequence Layer Syntax Elements in a Previous WMV Encoder and Decoder
To encode and decode certain characteristics of video sequences, a previous Windows Media Video (“WMV”) encoder and decoder use sequence-layer syntax elements in the bitstream resulting from encoding a video sequence. The sequence-layer syntax elements are contained in one or more sequence header in the bitstream and represent various encoding and display decisions for the pictures in the sequence.
The sequence-layer syntax elements include an element specifying encoding profiles or methods (PROFILE), a “sprite” mode element (SPRITEMODE), an interlace coding element (INTERLACE), a frame rate element (FRAMERATE), a bit rate element (BITRATE), a loop filtering element (LOOPFILTER), an I-picture coding technique element (X8INTRA), a multi-resolution coding element (MULTIRES), an inverse DCT transform element (FASTTX), a sub-pixel interpolation and rounding element (FASTUVMC), a broadcast element (BROADCAST), quantization elements (DQUANT, QUANTIZER), a variable-sized transform element (VSTRANSFORM), a DCT transform table-switching element (DCTTABSWITCH), an overlapped transform element (OVERLAP), a startcode synchronization marker element (STARTCODE), a pre-processing element (PREPROC), and a B-frame counter element (NUMBFRAMES). These sequence-layer elements indicate coding decisions/settings (e.g., on/off decisions for specific tools or options) that also affect decoding.
Although these sequence-layer elements allow an encoder and decoder to make encoding and display decisions on a sequence-by-sequence basis, the placement of these elements at sequence level is unnecessarily restrictive and inflexible in many contexts. On the other hand, to vary these decisions on a picture-by-picture basis, such elements would need to be signaled at picture level, which would result in undesirable increases in coding overhead. Although several shorter sequences with individual sequence headers can be sent in a bitstream, sequence headers typically contain more information than is needed for smaller chunks of video. In addition, frequently resetting control parameters in sequence headers that could otherwise remain constant is inefficient.
III. Access Points and Trick Modes in Standards for Video Compression and Decompression
Several international standards relate to video compression and decompression. These standards include the Motion Picture Experts Group (“MPEG”) 1, 2, and 4 standards and the H.261, H.262 (another title for MPEG 2), H.263 and H.264 (also called JVT/AVC) standards from the International Telecommunication Union (“ITU”). These standards specify aspects of video decoders and formats for compressed video information. Directly or by implication, they also specify certain encoder details, but other encoder details are not specified. These standards use (or support the use of) different combinations of intraframe and interframe decompression and compression. In particular, some of the standards use or support the use of different access points, headers, and trick modes for decoders and/or editors.