Digital video consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15 or 30 frames per second. Each frame can include tens or hundreds of thousands of pixels (also called pels). Each pixel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel as a set of three samples totaling 24 bits, although pixels of greater color depth can be represented by samples totaling 48 bits or more. Thus, the number of bits per second, or bit rate, of a typical raw digital video sequence can be 5 million bits/second or more.
Many computers and computer networks lack the resources to process raw digital video. For this reason, engineers use compression (also called coding or encoding) to reduce the bit rate of digital video. Compression decreases the cost of storing and transmitting video by converting the video into a lower bit rate form. Decompression (also called decoding) reconstructs a version of the original video from the compressed form. A “codec” is an encoder/decoder system. Compression can be lossless, in which quality of the video does not suffer but decreases in bit rate are limited by the inherent amount of variability (sometimes called entropy) of the video data. Or, compression can be lossy, in which quality of the video suffers but achievable decreases in bit rate are more dramatic. Lossy compression is often used in conjunction with lossless compression—in a system design in which the lossy compression establishes an approximation of information and lossless compression techniques are applied to represent the approximation.
In general, video compression techniques include “intra-picture” compression and “inter-picture” compression, where a picture is, for example, a progressively scanned video frame. For progressive video frames, intra-frame compression techniques compress individual frames (typically called I-frames or key frames). Interframe compression techniques compress frames (typically called predicted frames, P-frames, or B-frames for bi-directional prediction) with reference to preceding and/or following frames (typically called reference or anchor frames).
Encoded video bitstreams are often comprised of several syntax layers. Syntax elements that encode characteristics of a video bitstream are divided among the several layers depending on the desired scope of the characteristics. For example, a sequence layer syntax element typically applies to all pictures in a sequence, whereas a picture layer syntax element generally will affect only one corresponding picture within the sequence.
I. Interlaced Video and Progressive Video
A video frame contains lines of spatial information of a video signal. For progressive video, these lines contain samples starting from one time instant and continuing in raster scan fashion through successive lines to the bottom of the frame. A progressive I-frame is an intra-coded progressive video frame. A progressive P-frame is a progressive video frame coded using forward prediction, and a progressive B-frame is a progressive video frame coded using bi-directional prediction.
The primary aspect of interlaced video is that the raster scan of an entire video frame is performed in two passes by scanning alternate lines in each pass. For example, the first scan is made up of the even lines of the frame and the second scan is made up of the odd lines of the scan. This results in each frame containing two fields representing two different time epochs. FIG. 1 shows an interlaced video frame 100 that includes top field 110 and bottom field 120. In the frame 100, the even-numbered lines (top field) are scanned starting at one time (e.g., time t), and the odd-numbered lines (bottom field) are scanned starting at a different (typically later) time (e.g., time t+1). This timing can create jagged tooth-like features in regions of an interlaced video frame where motion is present when the two fields are scanned starting at different times. For this reason, interlaced video frames can be rearranged according to a field structure, with the odd lines grouped together in one field, and the even lines grouped together in another field. This arrangement, known as field coding, is useful in high-motion pictures for reduction of such jagged edge artifacts. On the other hand, in stationary regions, image detail in the interlaced video frame may be more efficiently preserved without such a rearrangement. Accordingly, frame coding is often used in stationary or low-motion interlaced video frames, in which the original alternating field line arrangement is preserved.
A typical progressive video frame consists of one frame of content with non-alternating lines. In contrast to interlaced video, progressive video does not divide video frames into separate fields, and an entire frame is scanned left to right, top to bottom starting at a single time.
II. Sequence Layer Syntax Elements in a Previous WMV Encoder and Decoder
To encode and decode certain characteristics of video sequences, a previous Windows Media Video (“WMV”) encoder and decoder use sequence-layer syntax elements in the bitstream resulting from encoding a video sequence. The sequence-layer syntax elements are contained in one or more sequence header in the bitstream and represent various encoding and display decisions for the pictures in the sequence.
The sequence-layer syntax elements include an element specifying encoding profiles or methods (PROFILE), a “sprite” mode element (SPRITEMODE), an interlace coding element (INTERLACE), a frame rate element (FRAMERATE), a bit rate element (BITRATE), a loop filtering element (LOOPFILTER), an I-picture coding technique element (X8INTRA), a multi-resolution coding element (MULTIRES), an inverse DCT transform element (FASTTX), a sub-pixel interpolation and rounding element (FASTUVMC), a broadcast element (BROADCAST), quantization elements (DQUANT, QUANTIZER), a variable-sized transform element (VSTRANSFORM), a DCT transform table-switching element (DCTTABSWITCH), an overlapped transform element (OVERLAP), a startcode synchronization marker element (STARTCODE), a pre-processing element (PREPROC), and a B-frame counter element (NUMBFRAMES). These sequence-layer elements indicate coding decisions/settings (e.g., on/off decisions for specific tools or options) that also affect decoding.
Although these sequence-layer elements allow an encoder and decoder to make encoding and display decisions on a sequence-by-sequence basis, the placement of these elements at sequence level is unnecessarily restrictive and inflexible in many contexts. On the other hand, to vary these decisions on a picture-by-picture basis, such elements would need to be signaled at picture level, which would result in undesirable increases in coding overhead. Although several shorter sequences with individual sequence headers can be sent in a bitstream, sequence headers typically contain more information than is needed for smaller chunks of video. In addition, frequently resetting control parameters in sequence headers that could otherwise remain constant is inefficient.
III. Access Points and Trick Modes in Standards for Video Compression and Decompression
Several international standards relate to video compression and decompression. These standards include the Motion Picture Experts Group (“MPEG”) 1, 2, and 4 standards and the H.261, H.262 (another title for MPEG 2), H.263 and H.264 (also called JVT/AVC) standards from the International Telecommunication Union (“ITU”). These standards specify aspects of video decoders and formats for compressed video information. Directly or by implication, they also specify certain encoder details, but other encoder details are not specified. These standards use (or support the use of) different combinations of intraframe and interframe decompression and compression. In particular, some of the standards use or support the use of different access points, headers, and trick modes for decoders and/or editors.
A. Access Points
The MPEG-2/H.262 standard describes intra-coded pictures (e.g., coded I-frames) and group-of-pictures (“GOP”) headers. In MPEG-2, intra-coded pictures are coded without reference to other pictures and provide access points to the coded sequence where decoding can begin. Intra-coded pictures can be used at different places in a video sequence. For example, intra-coded pictures can be inserted periodically or can be used in places such as scene changes or where motion compensation is otherwise ineffective. A coded I-frame is an I-frame picture or a pair of field pictures, where the first field picture encoded in the bitstream is an I-picture and the second field picture encoded in the bitstream is an I-picture or a P-picture. The MPEG-2 standard does not allow a coded I-frame in which the first field picture encoded in the bitstream is a P-picture and the second field picture encoded in the bitstream is an I-picture. When a coded I-frame is a pair of field pictures, and the second field picture encoded in the bitstream is a P-picture, the P-picture is motion compensated relative to the I-picture (first field picture encoded in the bitstream) in the same frame.
A GOP header is a construct in the MPEG-2 bitstream that signals the beginning of a group of pictures. Groups of pictures are typically used to signal the boundary of a set of video frames/fields all encoded with reference to the same I-frame. A GOP header is an optional header that may be signaled immediately before a coded I-frame to signal the beginning of a sequence of P and B pictures that are encoded with motion compensation relative to that I-frame. In particular, a closed GOP element indicates if the first consecutive B-pictures (if any) immediately following the coded I-frame in the bitstream (but typically preceding the coded I-frame in display order) can be reconstructed properly in the case of a random access. For such B-pictures, if a reference picture before the current coded I-frame is not available, the B-pictures cannot be reconstructed properly unless they only use backward prediction from the current coded I-frame or intra coding.
A decoder may therefore use information in a GOP header to avoid displaying B-pictures that cannot be correctly decoded. For a decoder, information in the GOP header thus indicates how the decoder can perform decoding from the GOP header, even if the GOP header is in the middle of a video sequence. For example, the closed_gop flag indicates the nature of the predictions used in the first consecutive B-pictures (if any) immediately following the first coded I-frame following the GOP header. The closed_gop flag is set to ‘1’ to indicate that these B-pictures have been encoded using only backward prediction or intra coding. The broken_link flag is set to ‘1’ to indicate that the first consecutive B-pictures (if any) immediately following the first coded I-frame following the GOP header may not be correctly decoded because the reference frame which is used for prediction is not available (because of the action of editing which has replaced the preceding pictures with pictures from another video sequence (e.g., a commercial)). A decoder may use this flag to avoid displaying frames that cannot be correctly decoded.
The GOP header also includes other information such as time code information and a start code called group_start_code. The GOP header start code includes a 24-bit start code prefix (23 0s followed by a 1) followed by the GOP header start code value (B8 in hexadecimal).
The MPEG-4 standard describes intra-coded video object planes (“I-VOPs”) and group of video object plane (“VOP”) headers. An I-VOP is a VOP coded using information only from itself. Non-intra-coded VOPs may be derived from progressive or interlaced frames. In MPEG-4, I-VOPs are coded without reference to other pictures and provide access points to the coded sequence where decoding can begin. A group of VOP header is an optional header that can be used immediately before a coded I-VOP to indicate to the decoder (e.g., via the broken_link flag) if the first consecutive B-VOPs immediately following the coded I-frame can be reconstructed properly in the case of a random access. A group of VOP header must be followed by a coded I-VOP.
A group of VOP header includes information such as the closed_gov flag, which indicates whether the first consecutive B-VOPs (if any) immediately following the first coded I-VOP after the group of VOP header have been encoded using only backward prediction or intra coding. The broken_link flag may be set to ‘1’ to avoid displaying B-VOPs following the first I-VOP if they cannot be correctly decoded.
The group of VOP header also includes other information such as time code information and a start code. A group of VOPs start code includes a 24-bit start code prefix (23 0s followed by a 1) followed by the group of VOPs start code value (B3 in hexadecimal).
According to draft JVT-d157 of the JVT/AVC video standard, 1-pictures or slices provide access points to a coded sequence where decoding can begin, and various information used in decoding is signaled in network abstraction layer (“NAL”) units. A NAL unit indicates what type of data to expect in the NAL unit, followed by the data itself, interspersed with emulation prevention data. A supplemental enhancement information (“SEI”) NAL unit contains one or more SEI messages. Each SEI message consists of SEI header and SEI payload. The type and size of the SEI payload are coded using an extensible syntax. The SEI payload may have an SEI payload header. For example, a payload header may indicate to which picture the particular data belongs.
Annex D of the draft JVT-d157 describes a syntax for a random access point SEI message. A random access point SEI message contains an indicator of a random access entry point for a decoder. The entry point is indicated as a count relative to the position of the SEI message in units of coded frame numbers prior to the frame number of the current picture. In a random access point SEI message, preroll_count indicates the entry point for the decoding process, and postroll_count indicates the recovery point of output. The exact_match_flag indicates whether decoded pictures at and subsequent to the recovery point in output order obtained by starting the decoding process at the specified entry point shall be an exact_match to the pictures that would be produced by a decoder starting at the last prior instantaneous decoder refresh (“IDR”) point in the NAL unit stream. (An IDR picture is an I-picture that causes a decoder to mark all reference pictures in a decoded pictures buffer as unused immediately before decoding the IDR picture, and to indicate that later coded pictures can be decoded without inter prediction from any picture decoded prior to the IDR picture.) The broken_link_flag indicates the presence or absence of a splicing point in the NAL unit stream at the location of the random access point SEI message.
For additional information, see the standards themselves.
B. Trick Modes
The MPEG-2 standard describes special access, search and scan modes (examples of trick modes). According to ISO/IEC 13818-02, the 1-bit DSM_trick_mode_flag in a program elementary stream (“PES”) packet indicates that the PES packet in an MPEG-2 elementary stream is reconstructed from digital storage media (“DSM”) in a trick mode. When DSM_trick_mode_flag is set, eight bits of trick mode information (the DSM_trick_modes element) follow in the PES packet. The first three bits indicate the trick mode (e.g., fast forward, slow motion, freeze frame, fast reverse, slow reverse) and the remaining five bits provide information specific to the indicated trick mode. For example, ISO/IEC 13818-1:2000 specifies that if DSM_trick_mode_flag=1, the 3-bit element trick_mode_control indicates the specific trick mode, while the next five bits indicate provide other information depending on the specific trick mode, such as indicators of which field should be displayed or whether an entire frame should be displayed (field_id), number of times a field or frame should be repeated (rep_cntrl), coefficient frequency truncation information (frequency_truncation), and intra-slice refresh information (intra_slice_refresh).
When a decoder is decoding a PES packet with DSM_trick_mode_flag=1, the 13818-02 recommends decoders to decode the bitstream and display according to the DSM_trick_modes element. For pre-processing, the MPEG-2 standard recommends that decoders clear a non-trick mode bitstream from the buffer when the decoder encounters a PES packet with DSM_trick_mode_flag=1. For post-processing, the MPEG-2 standard recommends that decoders clear a trick mode bitstream from the buffer when the decoder encounters a PES packet with DSM_trick_mode_flag=0. MPEG-2 recommends a decoder decoding a PES packet with DSM_trick_mode_flag=1 to decode one picture and display it until next picture is decoded. If the decoder encounters a gap between slices, the decoder is recommended to decode the slice and display it according to the slice vertical position in slice header, and fill the gap with a co-located part of the last displayed picture.
ISO/IEC 13818-06 describes a different approach for trick modes. According to ISO/IEC 13818-06, stream primitives (e.g., “Stream pause( )”, “Stream resume( )”, and “Stream play( )”) are used to emulate VCR-like controls for manipulating MPEG continuous media streams.
C. Limitations of the Standards
These international standards are limited in several important ways. For example, in MPEG-2, the first coded frame after a GOP header must be a “coded I-frame”—an intra-coded frame picture or a pair of field pictures where the first field picture encoded in the bitstream is an I-picture and the second field picture encoded in the bitstream is either an I-picture or a P-picture. GOP headers are not allowed to precede any other frame type. In MPEG-4, a group of VOP header must be followed by a coded I-VOP.
Trick mode signaling and processing according to 13818-01 and -02 have many disadvantages. They involve tight coordination between the decoder of the MPEG-2 video bitstream and the receiver-side components processing the PES packets and trick mode syntax elements therein. This complicates the design and implementation of the decoder of the MPEG-2 video bitstream. In addition, the MPEG-2 trick modes typically require the receiver-side components to adjust time stamps for individual pictures in order to maintain synchronization of various decoding buffers, which further complicates trick mode processing. While the mechanism described in 13818-06 simplifies decoder development and implementation to some extent, the latency between client and server can lead to unacceptable delays in trick mode performance, especially when consumers expect VCR-like functionality and responsiveness.
Given the critical importance of video compression and decompression to digital video, it is not surprising that video compression and decompression are richly developed fields. Whatever the benefits of previous video compression and decompression techniques, however, they do not have the advantages of the following techniques and tools.