In the communications industry, much attention has been focused on making more effective use of the limited number of transmission channels currently available for delivering video information and programming to an end user, such as a home viewer of cable television. Various methodologies have thus been developed to achieve the effect of an increase in the number of transmission channels that can be broadcast within the frequency bandwidth that is currently allocated to a single video transmission channel. An increase in the number of available transmission channels would allow the communications industry to reduce costs and to increase broadcast capacity. It has been estimated that a typical cable operator could have the capability to deliver as many as 500 channels to a home viewer.
A dramatic increase in the number of separate channels that could be broadcast with the currently available transmission bandwidth may be realized by employing a process for compressing and decompressing video signals. Typically, the video program signal is converted to a digital format compressed and encoded in accordance with an established compression algorithm or methodology. This compressed digital system signal, or bitstream, which includes a video portion, an audio portion, and other informational portions, is then transmitted to a receiver. Transmission may be over existing television channels, cable television channels, satellite communication channels, and the like. A decoder is then typically employed at the receiver to decompress and decode the received system signal in accordance with the same compression algorithm previously mentioned. The decoded video information may then be output to a display device, such as a television monitor.
VIDEO ENCODING
Video compression and encoding is typically performed by a video encoder. The video encoder normally implements a selected data compression algorithm that conforms to a recognized standard or specification agreed to among the senders and receivers of digital video signals. One such emerging standard developed by the Moving Pictures Experts Group, is generally referred to as the MPEG International Standard DIS 11172. The MPEG standard defines a format for compressed digital video which supports data rates of about 1 to 1.8 megabits per second, resolutions of about 352 pixels (picture elements) horizontally to about 288 lines vertically, picture rates of about 24 to 30 pictures per second, and several VCR-like viewing options such as Normal Forward, Play, Slow Forward, Fast Forward, Fast Reverse, and Freeze.
In order to compress a video signal, it is typically necessary to sample the analog data and represent this data with digital values of luminance and color difference. The MPEG standard specifies that a luminance component (Y) of a video signal be sampled with respect to a color difference signals (Cr,Cb) by a ratio of two-to-one (2:1). That is, for every two samples of the luminance component Y, there should be one sub-sample each of the color difference components Cr and Cb. It is currently believed that the 2:1 sampling ratio is appropriate because the human eye is much more sensitive to luminance (brightness) components than to color components. Video sampling typically is performed in both the vertical and horizontal directions. Once the video signal is sampled, it is reformatted, for example, into a non-interlaced signal. An interlaced signal is one that contains only part of the picture content (i.e. every other horizontal line) for each complete display scan. A noninterlaced signal, in contrast, is one that contains all of the picture content. After a video signal is sampled and reformatted, the encoder may process it further by converting it to a different resolution in accordance with the image area to be displayed. In doing so, the encoder must determine which type of picture is to be encoded. A picture may be considered as corresponding to a single frame of motion video, or to a frame of movie film. However, different types of picture types may be employed for digital video transmission. The most prevalent picture types are: I-Pictures (Intra-Coded Pictures) which are coded without reference to any other pictures and are often referred to as anchor frames; P-Pictures (Predictive-Coded Pictures) which are coded using motion-compensated prediction from the past I- or P-reference picture, and may also be considered anchor frames; and B-Pictures (Bi-directionally Predictive-Coded Pictures) which are coded using motion compensation from a previous and a future I- or P-Picture.
A typical coding scheme may employ a mixture of I-, P-, and B-Pictures. Typically, an I-Picture may occur every half a second, with two B-Pictures inserted between each pair of I- or P-Pictures. I-Pictures provide random access points within the coded sequence of pictures where decoding can begin, but are coded with only a moderate degree of compression. P-Pictures are coded more efficiently using motion compensated prediction from a past I- or P-Picture and are generally used as a reference for further prediction. B-Pictures provide the highest degree of compression but require both past and future reference pictures for motion compensation. B-Pictures are generally not used as references for prediction. The organization of the three picture types in a particular video sequence is very flexible. A fourth picture type is defined by the MPEG standard as a D-Picture, or DC-Picture, which is provided to allow a simple, but limited quality, Fast-Forward mode.
Once the picture types have been defined, the encoder may estimate motion vectors for each 16.times.16 macroblock in a picture. A macroblock consists of a 16-pixel by 16-line section of the luminance component (Y) and two spatially corresponding 8-pixel by 8-line sections, one for each chrominance component Cr and Cb. Motion vectors provide displacement information between a current picture and a previously stored picture. P-Pictures use motion compensation to exploit temporal redundancy, or lack of motion, between picture frames in the video. Apparent motion between sequential pictures is caused by pixels in a previous picture occupying different positions with respect to the pixels in a current macroblock. This displacement between pixels in a previous and a current macroblock is represented by motion vectors encoded in the MPEG bitstream. Typically, the encoder chooses which picture type is to be used for each given frame. Having defined the picture type, the encoder then estimates motion vectors for each 16.times.16 macroblock in the picture. Typically in P-Pictures, one vector is employed for each macroblock, and in B-Pictures, one or two vectors are used. When the encoder processes B-Pictures, it usually re-orders the picture sequence so that a video decoder receiving the digital video signal operates properly. Since B-Pictures are usually coded using motion compensation based on previously sent I- or P-Pictures, the B-Pictures can only be decoded after the subsequent reference pictures (an I- or P-Picture) has been decoded. Thus, the sequence of the series of pictures may be re-ordered by the encoder so that the pictures arrive at the decoder in a proper sequence for decoding of the video signal. The decoder may then re-order the pictures in proper sequence for viewing.
As mentioned previously, a macroblock is a 16.times.16 region of video data, corresponding to 16 pixels in the horizontal direction and 16 display lines in the vertical direction. When sampling is performed by the video encoder, every luminance component (Y) of every pixel in the horizontal direction is captured, and every luminance component of every line in the vertical direction is captured. However, only every other Cb and Cr chrominance component is similarly captured. The result is a 16.times.16 block of luminance components and two 8.times.8 blocks each of Cr and Cb components. Each macroblock of video data thus consists of a total of six 8.times.8 blocks (four 8.times.8 luminance blocks, one 8.times.8 Cr block, and one 8.times.8 Cb block). The spatial picture area covered by four 8.times.8 blocks of luminance occupies an area equivalent to the region covered by each of the 8.times.8 chrominance blocks. Since there are half as many chrominance samples needed to cover the same area, they fit into an 8.times.8 block instead of a 16.times.16 block.
For a given macroblock of video data, the encoder is programmed to select a coding mode depending on the picture type, the effectiveness of motion compensation in the particular region of the picture, and the nature of the signal within the block. After the coding method is selected, the encoder performs a motion-compensated prediction of the block contents based on past and/or future reference pictures. The encoder then produces an error signal by subtracting the prediction from the actual data in the current macroblock. The error signal is similarly separated into 8.times.8 blocks (four luminance blocks and two chrominance blocks). A Discrete Cosine Transform (DCT) may then be performed on each block to achieve further compression. The DCT operation converts an 8.times.8 block of pixel values to an 8.times.8 matrix of horizontal and vertical coefficients of spatial frequency. Coefficients representing one or more non-zero horizontal or non-zero vertical spatial frequencies are called AC coefficients. An 8.times.8 block of pixel values can subsequently be reconstructed by a video decoder performing an Inverse Discrete Cosine Transform (IDCT) on the spatial frequency coefficients.
Additional compression is provided through predictive coding since the difference in the average value of neighboring 8.times.8 blocks tends to be relatively small. Predictive coding is a technique employed to improve compression based on the blocks of pixel information previously operated on by an encoder. A prediction of the pixel values for a block yet to be encoded may be performed by the encoder. The difference between the predicted and actual pixel values may then be computed and encoded. The different valves represent prediction errors which may later be used by a video decoder to correct the information of a predicted block of pixel values.
In addition to the signal compression that is achieved by the encoding process itself, a substantial degree of intentional signal compression is achieved by a process of selecting a quantization step size, where the quantization intervals or steps are identified by an index. The quantization level of frequency coefficients corresponding to the higher spatial frequencies favors the creation of coefficient values of zero by choosing an appropriate quantization step size in which the human visual perception system is unlikely to notice the loss of a particular spatial frequency unless the coefficient value for that spatial frequency rises above the particular quantization level chosen. The statistical encoding of the expected runs of consecutive zeroed-valued coefficients corresponding to the higher-order coefficients accounts for considerable compression gain.
In order to cluster non-zero coefficients early in the series and to encode as many zero coefficients as possible following the last non-zero coefficient in the ordering, the coefficient sequence is organized in a specified orientation termed zigzag ordering. Zigzag ordering concentrates the highest spatial frequencies at the end of the series. Once the zigzag ordering has been performed, the encoder typically performs "run-length coding" on the AC coefficients. This process reduces each 8.times.8 block of DCT coefficients to a number of events represented by a non-zero coefficient and the number of preceding zero coefficients. Because the high-frequency coefficients are more likely to be zero, run-length coding results in additional video compression.
The encoder may then perform Variable-Length Coding (VLC) on the resulting data. VLC is a reversible procedure for coding data that assigns shorter code words to frequent events and longer code words to less frequent events, thereby achieving additional video compression. Huffman encoding is a particularly well-known form of VLC that reduces the number of bits necessary to represent a data set without losing any information. The final compressed video data is then ready to be transmitted to a storage device or over a transmission medium for reception and decompression by a remotely located decoder. The MPEG standard specifies a particular syntax for a compressed bitstream. The MPEG video syntax comprises six layers, each of which supports either a signal processing function or a system function. The MPEG syntax layers correspond to a hierarchical structure. A "sequence" is the top layer of the video coding hierarchy and consists of a header and some number of "Groups-of-Pictures" (GOPs). The sequence header generally initializes the state of the decoder, which allows the decoder to decode any sequence without being affected by past decoding history. A GOP is a random access point, that is, it is the smallest coding unit that can be independently decoded within a sequence. A GOP typically consists of a header and some number of "pictures." The GOP header contains time and editing information. As discussed previously, there are four types of pictures: I-Pictures, P-Pictures, B-Pictures, and D-Pictures. Because of the picture dependencies, the order in which the pictures are transmitted, stored, or retrieved, is not the display order, but rather an order required by the decoder to properly decode the pictures in the bitstream. For example, a typical sequence of pictures, in display order, might be shown as follows:
__________________________________________________________________________ I B B P B B P B B P B B I B B P B B P __________________________________________________________________________ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 __________________________________________________________________________
By contrast, the bitstream order corresponding to the given display order would be as follows:
__________________________________________________________________________ I P B B P B B P B B I B B P B B P B B __________________________________________________________________________ 0 3 1 2 6 4 5 9 7 8 12 10 11 15 13 14 18 16 17 __________________________________________________________________________
Because the B-Pictures depend on a subsequent I- or P-Picture in display order, the I- or P-Picture must be transmitted and decoded before the dependent B-Pictures.
Each of the "picture" portions of a GOP consists of a header and one or more "slices." The picture header contains time stamp, picture type, and coding information. A slice consists of an integral number of macroblocks from a picture and can be used by a video decoder to recover from decoding errors. If the bitstream becomes unreadable within a picture, the decoder will normally be able to recover by waiting for the next slice, without having to drop the entire picture. A slice also includes a header that contains position and quantizer scale information. "Blocks" are the basic coding unit, and the DCT is applied at this block level. Each block typically contains 64 component pixels arranged in an 8.times.8 order. The pixel values are not individually coded, but are components of the coded block. A macroblock is the basic unit for motion compensation and quantizer scale changes. As discussed previously, each macroblock consists of a header and six component 8.times.8 blocks: four blocks of luminance, one block of Cb chrominance, and one block of Cr chrominance. The macroblock header contains quantizer scale and motion compensation information.
VIDEO DECODING
The video decoding process is generally the inverse of the video encoding process and is employed to reconstruct a motion picture sequence from a compressed and encoded bitstream. The data in the bitstream is decoded according to a syntax that is itself defined by the data compression algorithm. The decoder must first identify the beginning of a coded picture, identify the type of picture, then decode each individual macroblock within a particular picture. If there are motion vectors and macroblock types (each of the picture types I, P, and B have their own macroblock types) present in the bitstream, they can be used to construct a prediction of the current macroblock based on past and future reference pictures that the decoder has already stored. Coefficient data is then inverse quantized and operated on by an inverse DCT process (IDCT) so as to transform the macroblock data from the frequency domain to data in the time and space domain.
After all of the macroblocks have been processed by the decoder, the picture reconstruction is complete. If a reconstructed picture is a reference picture (I-Picture), it replaces the oldest stored reference picture and is used as the new reference for subsequent pictures. As noted above, the pictures may also need to be re-ordered before they are displayed in accordance with their display order instead of their coding order. After the pictures are re-ordered, they may then be displayed on an appropriate output device.
PRIOR ART DECODING SCHEMES
In FIG. 1, there is shown a typical and conventional video decoding and display system illustrated in block diagram form. An encoded system bitstream, containing both video, audio, and other information, is typically written directly to a channel buffer 10 from a fixed rate channel 8. A synchronizer 12 receives the multiplexed system bitstream from the channel buffer 10 and pre-processes the system bitstream prior to its being input to a video decoder 14. Synchronization generally involves finding a unique pattern of bits, often termed sync codes or start codes, in the multiplexed system bitstream, and aligning the bitstream data following the sync code. The various groupings of bits making up the bitstream are often referred to as variable length symbols. These variable length symbols typically represent specific signal information in accordance with the syntax of the encoding and decoding algorithm employed, such as the MPEG standard.
In a conventional configuration, as illustrated in FIG. 1, the channel buffer 10 must have sufficient storage capacity to store the continuous stream of system bitstream data that is transmitted through the fixed rate channel 8. Also, the channel buffer 10 must have additional storage capacity to store bitstream data previously received from the fixed rate channel 8 that is temporarily buffered and awaiting eventual transfer to the synchronizer 12. At the appropriate time, the bitstream data stored in the channel buffer 10 is transferred to the synchronizer 12 and then to the video decoder 14. The video data component of the multiplexed system bitstream may then be decoded by the video decoder 14 and picture reconstruction subsequently performed. The video decoder 14 temporarily stores the pictures to be displayed for a period of time necessary for the decoder 14 to synchronize with the display controller 16. After synchronization between the video decoder 14 and the display controller 16 have been established, the display controller 16 must typically re-initialize to a state required to accept a subsequent reconstructed picture from the decoder 14. The display controller 16 then reads the frame to be displayed from the video decoder 14 for output to an appropriate output device 18.
It can be appreciated that processing delays associated with re-initializing the display controller 16 and synchronizing the display controller 16 with the video decoder 14 has the adverse affect of delaying further processing of bitstream data received from the fixed rate channel 8. During these periods of delay, the channel buffer 10 must accommodate the bitstream data being continuously received from the fixed rate channel 8 in order to prevent the loss of the incoming bitstream data. Moreover, these delays may result in the accumulation of additional frames which must be stored in the video decoder 14 while the display controller 16 re-initializes and synchronizes with the decoder 14. Thus, the processing delays inherent in a conventional video decoding scheme, as illustrated in FIG. 1, usually necessitate a substantial increase in the amount of memory allocated to both the channel buffer 10 and the video decoder 14.
It has been estimated that in a conventional television system, the time period required to synchronize the display controller 16 with the video decoder 14 is approximately 33 milliseconds. A time period of 33 milliseconds is roughly equivalent to the time it takes to display a single frame of video at a rate of 30 frames per second (National Television System Committee [NTSC] Standard display rate). Accordingly, the channel buffer 10 must have a memory capacity sufficient to accommodate the bitstream data received from the fixed rate channel 8 during this 33 millisecond delay. With a typical channel bit-rate of 8 megabits per second, the channel buffer 10 would have to store approximately 266 kilobits of excess bitstream data. It is anticipated that channel rates of 15 megabits per second may be appropriate in certain video decoding system configurations. At a channel rate of 15 megabits per second, the channel buffer 10 memory would have to be expanded to accommodate nearly 500 kilobits of additional bitstream data received from the fixed rate channel 8. Other processing delays, such as display controller 16 re-initialization, would result in the further accumulation of excess bitstream data in the channel buffer 10. The substantial increase in the amount of memory required to buffer this excess bitstream data in the channel buffer 10 during these processing delays would necessarily result in a significant increase in the size, complexity, and cost of the video decoding circuitry.
In view of the deficiencies inherent in conventional video decoding schemes discussed above, digital video signal transmission is still highly complex and expensive. Thus, there exists in the communications industry a keenly felt need to increase the efficiency of video decoders while minimizing both the complexity and cost of effective implementations. The present invention fulfills this need.