In the communications industry, much attention has been focused on making more effective use of the limited number of transmission channels currently available for delivering video information and programming to an end user, such as a home viewer of cable television. Various methodologies have thus been developed to achieve the effect of an increase in the number of transmission channels that can be broadcast within the frequency bandwidth that is currently allocated to a single video transmission channel. An increase in the number of available transmission channels would allow the communications industry to reduce costs and to increase broadcast capacity. It has been estimated that a typical cable operator could have the capability to deliver as many as 500 channels to a home viewer.
A dramatic increase in the number of separate channels that could be broadcast with the currently available transmission bandwidth may be realized by employing a process for compressing and decompressing video signals. Typically, the video program signal is converted to a digital format compressed and encoded in accordance with an established compression algorithm or methodology. This compressed digital system signal, or bitstream, which includes a video portion, an audio portion, and other informational portions, is then transmitted to a receiver. Transmission may be over existing television channels, cable television channels, satellite communication channels, and the like. A decoder is then typically employed at the receiver to decompress and decode the received system signal in accordance with the same compression algorithm previously mentioned. The decoded video information may then be output to a display device, such as a television monitor.
VIDEO ENCODING
Video compression and encoding is typically performed by a video encoder. The video encoder normally implements a selected data compression algorithm that conforms to a recognized standard or specification agreed to among the senders and receivers of digital video signals. One such emerging standard developed by the Moving Pictures Experts Group, is generally referred to as the MPEG International Standard DIS 11172. TheMPEG standard defines a format for compressed digital video which supports data rates of about 1 to 1.8 megabits per second, resolutions of about 352 pixels (picture elements) horizontally to about 288 lines vertically, picture rates of about 24 to 30 pictures per second, and several VCR-like viewing options such as Normal Forward, Play, Slow Forward, Fast Forward, Fast Reverse, and Freeze.
In order to compress a video signal, it is typically necessary to sample the analog data and represent this data with digital values of luminance and color difference. The MPEG standard specifies that a luminance component (Y) of a video signal be sampled with respect to a color difference signals (Cr,Cb) by a ratio of two-to-one (2:1). That is, for every two samples of the luminance component Y, there should be one sub-sample each of the color difference components Cr and Cb. It is currently believed that the 2:1 sampling ratio is appropriate because the human eye is much more sensitive to luminance (brightness) components than to color components. Video sampling typically is performed in both the vertical and horizontal directions. Once the video signal is sampled, it is reformatted, for example, into a non-interlaced signal. An interlaced signal is one that contains only part of the picture content (i.e. every other horizontal line) for each complete display scan. A non-interlaced signal, in contrast, is one that contains all of the picture content. After a video signal is sampled and reformatted, the encoder may process it further by converting it to a different resolution in accordance with the image area to be displayed. In doing so, the encoder must determine which type of picture is to be encoded. A picture may be considered as corresponding to a single frame of motion video, or to a frame of movie film. However, different types of picture types may be employed for digital video transmission. The most prevalent picture types are: I-Pictures (Intra-Coded Pictures) which are coded without reference to any other pictures and are often referred to as anchor frames; P-Pictures (Predictive-Coded Pictures) which are coded using motion-compensated prediction from the past I- or P-reference picture, and may also be considered anchor frames; and B-Pictures (Bi-directionally Predictive-Coded Pictures) which are coded using motion compensation from a previous and a future I- or P-Picture.
A typical coding scheme may employ a mixture of I-, P-, and B-Pictures. Typically, an I-Picture may occur every half a second, with two B-Pictures inserted between each pair of I- or P-Pictures. I-Pictures provide random access points within the coded sequence of pictures where decoding can begin, but are coded with only a moderate degree of compression. P-Pictures are coded more efficiently using motion compensated prediction from a past I- or P-Picture and are generally used as a reference for further prediction. B-Pictures provide the highest degree of compression but require both past and future reference pictures for motion compensation. B-Pictures are generally not used as references for prediction. The organization of the three picture types in a particular video sequence is very flexible. A fourth picture type is defined by the MPEG standard as a D-Picture, or DC-Picture, which is provided to allow a simple, but limited quality, Fast-Forward mode.
Once the picture types have been defined, the encoder may estimate motion vectors for each 16.times.16 macroblock in a picture. A macroblock consists of a 16-pixel by 16-line section of the luminance component (Y) and two spatially corresponding 8-pixel by 8-line sections, one for each chrominance component Cr and Cb. Motion vectors provide displacement information between a current picture and a previously stored picture. P-Pictures use motion compensation to exploit temporal redundancy, or lack of motion, between picture frames in the video. Apparent motion between sequential pictures is caused by pixels in a previous picture occupying different positions with respect to the pixels in a current macroblock. This displacement between pixels in a previous and a current macroblock is represented by motion vectors encoded in the MPEG bitstream. Typically, the encoder chooses which picture type is to be used for each given frame. Having defined the picture type, the encoder then estimates motion vectors for each 16.times.16 macroblock in the picture. Typically in P-Pictures, one vector is employed for each macroblock, and in B-Pictures, one or two vectors are used. When the encoder processes B-Pictures, it usually re-orders the picture sequence so that a video decoder receiving the digital video signal operates properly. Since B-Pictures are usually coded using motion compensation based on previously sent I- or P-Pictures, the B-Pictures can only be decoded after the subsequent reference pictures (an I- or P-Picture) has been decoded. Thus, the sequence of the series of pictures may be re-ordered by the encoder so that the pictures arrive at the decoder in a proper sequence for decoding of the video signal. The decoder may then re-order the pictures in proper sequence for viewing.
As mentioned previously, a macroblock is a 16.times.16 region of video data, corresponding to 16 pixels in the horizontal direction and 16 display lines in the vertical direction. When sampling is performed by the video encoder, every luminance component (Y) of every pixel in the horizontal direction is captured, and every luminance component of every line in the vertical direction is captured. However, only every other Cb and Cr chrominance component is similarly captured. The result is a 16.times.16 block of luminance components and two 8.times.8 blocks each of Cr and Cb components. Each macroblock of video data thus consists of a total of six 8.times.8 blocks (four 8.times.8 luminance blocks, one 8.times.8 Cr block, and one 8.times.8 Cb block). The spatial picture area covered by four 8.times.8 blocks of luminance occupies an area equivalent to the region covered by each of the 8.times.8 chrominance blocks. Since there are half as many chrominance samples needed to cover the same area, they fit into an 8.times.8 block instead of a 16.times.16 block.
For a given macroblock of video data, the encoder is programmed to select a coding mode depending on the picture type, the effectiveness of motion compensation in the particular region of the picture, and the nature of the signal within the block. After the coding method is selected, the encoder performs a motion-compensated prediction of the block contents based on past and/or future reference pictures. The encoder then produces an error signal by subtracting the prediction from the actual data in the current macroblock. The error signal is similarly separated into 8.times.8 blocks (four luminance blocks and two chrominance blocks). A Discrete Cosine Transform (DCT) may then be performed on each block to achieve further compression. The DCT operation converts an 8.times.8 block of pixel values to an 8.times.8 matrix of horizontal and vertical coefficients of spatial frequency. Coefficients representing one or more non-zero horizontal or non-zero vertical spatial frequencies are called AC coefficients. An 8.times.8 block of pixel values can subsequently be reconstructed by a video decoder performing an Inverse Discrete Cosine Transform (IDCT) on the spatial frequency coefficients.
Additional compression is provided through predictive coding since the difference in the average value of neighboring 8.times.8 blocks tends to be relatively small. Predictive coding is a technique employed to improve compression based on the blocks of pixel information previously operated on by an encoder. A prediction of the pixel values for a block yet to be encoded may be performed by the encoder. The difference between the predicted and actual pixel values may then be computed and encoded. The different valves represent prediction errors which may later be used by a video decoder to correct the information of a predicted block of pixel values.
In addition to the signal compression that is achieved by the encoding process itself, a substantial degree of intentional signal compression is achieved by a process of selecting a quantization step size, where the quantization intervals or steps are identified by an index. The quantization level of frequency coefficients corresponding to the higher spatial frequencies favors the creation of coefficient values of zero by choosing an appropriate quantization step size in which the human visual perception system is unlikely to notice the loss of a particular spatial frequency unless the coefficient value for that spatial frequency rises above the particular quantization level chosen. The statistical encoding of the expected runs of consecutive zeroed-valued coefficients corresponding to the higher-order coefficients accounts for considerable compression gain.
In order to cluster non-zero coefficients early in the series and to encode as many zero coefficients as possible following the last non-zero coefficient in the ordering, the coefficient sequence is organized in a specified orientation termed zigzag ordering. Zigzag ordering concentrates the highest spatial frequencies at the end of the series. Once the zigzag ordering has been performed, the encoder typically performs "run-length coding" on the AC coefficients. This process reduces each 8.times.8 block of DCT coefficients to a number of events represented by a non-zero coefficient and the number of preceding zero coefficients. Because the high-frequency coefficients are more likely to be zero, run-length coding results in additional video compression.
The encoder may then perform Variable-Length Coding (VLC) on the resulting data. VLC is a reversible procedure for coding data that assigns shorter code words to frequent events and longer code words to less frequent events, thereby achieving additional video compression. Huffman encoding is a particularly well-known form of VLC that reduces the number of bits necessary to represent a data set without losing any information. The final compressed video data is then ready to be transmitted to a storage device or over a transmission medium for reception and decompression by a remotely located decoder. The MPEG standard specifies a particular syntax for a compressed bitstream. The MPEG video syntax comprises six layers, each of which supports either a signal processing function or a system function. The MPEG syntax layers correspond to a hierarchical structure. A "sequence" is the top layer of the video coding hierarchy and consists of a header and some number of "Groups-of-Pictures"(GOPs). The sequence header generally initializes the state of the decoder, which allows the decoder to decode any sequence without being affected by past decoding history. A GOP is a random access point, that is, it is the smallest coding unit that can be independently decoded within a sequence. A GOP typically consists of a header and some number of "pictures." The GOP header contains time and editing information. As discussed previously, there are four types of pictures: I-Pictures, P-Pictures, B-Pictures, and D-Pictures. Because of the picture dependencies, the order in which the pictures are transmitted, stored, or retrieved, is not the display order, but rather an order required by the decoder to properly decode the pictures in the bitstream. For example, a typical sequence of pictures, in display order, might be shown as follows: I B B P B B P B B P B B I B B P B B P 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
By contrast, the bitstream order corresponding to the given display order would be as follows: I P B B P B B P B B I B B P B B P B B 0 3 1 2 6 4 5 9 7 8 12 10 11 15 13 14 18 16 17
Because the B-Pictures depend on a subsequent I- or P-Picture in display order, the I- or P-Picture must be transmitted and decoded before the dependent B-Pictures.
Each of the "picture" portions of a GOP consists of a header and one or more "slices." The picture header contains time stamp, picture type, and coding information. A slice consists of an integral number of macroblocks from a picture and can be used by a video decoder to recover from decoding errors. If the bitstream becomes unreadable within a picture, the decoder will normally be able to recover by waiting for the next slice, without having to drop the entire picture. A slice also includes a header that contains position and quantizer scale information. "Blocks" are the basic coding unit, and the DCT is applied at this block level. Each block typically contains 64 component pixels arranged in an 8.times.8 order. The pixel values are not individually coded, but are components of the coded block. A macroblock is the basic unit for motion compensation and quantizer scale changes. As discussed previously, each macroblock consists of a header and six component 8.times.8 blocks: four blocks of luminance, one block of Cb chrominance, and one block of Cr chrominance. The macroblock header contains quantizer scale and motion compensation information.