1. The Field of the Invention
The present invention relates to the field of digital video. More specifically, the present invention relates to reducing the memory bandwidth and memory footprint, needed to decode and display MPEG video streams.
2. The Relevant Technology
High Definition Television (“HDTV”) is a type of television that provides much better resolution than televisions based on National Television Systems Committee (“NTSC”) standard. Although there are a number of competing HDTV standards, all standards support a wider screen than NTSC and roughly twice the resolution. Sending uncompressed analog HDTV data requires a bandwidth of about 18 Mhz. However, current terrestrial channel allocations are limited to 6 Mhz. As a result, HDTV video frames are digitized and then compressed before they are transmitted and then decompressed when they reach a receiving device, such as an HDTV television.
One widely used compression method is based on the Moving Pictures Experts Group standard and is commonly referred to as MPEG. MPEG employs interframe encoding, which means some of the frames are used as reference frames for other frames in compressed video data. An MPEG video bit stream includes I-frames, P-frames and B-frames. I-frames and P-frames can be used as a reference for other frames, hence they are known collectively as reference frames. I-frames or “Intraframes,” are independent frames that may be encoded and decoded independently without referring to any other frames in the MPEG video bit stream. P-frames, or “Predictive” frames, are encoded and decoded using the previous reference frame, be it an I-frame or a P-frame. B-frames, or “Bi-directionally predictive” frames, are reproduced using reference frames that are the closest temporally previous to and/or subsequent to the B-frame. Since I-frames do not reference other frames for information, I-frames are typically substantially larger in size than P-frames and B-frames.
An MPEG video frame having the YUV420 format includes regions designated as macroblocks having a size of 16 pixels by 16 lines. Within each macroblock, there are six 8×8 blocks of data, four for luminance components, and two for subsampled chrominance data.
As shown in FIG. 1a, an MPEG encoding system, such as MPEG encoder 100, receives video data 104, which is a sequence of video images. MPEG encoder 100 typically includes discrete cosine transform (DCT) module 101, motion vector generation module 102 and a picture type determination module 103, which separate video data 104 into different requisite parts. DCT module 101 is used to transform blocks of the video data from the spatial domain into a frequency domain representation of the same blocks. Motion vector generation module 102 is used to generate motion vectors, which represent motion between macroblock regions in the frames of video data 104. Picture type determination module 103 determines which frames should be used as reference frames (I-frames). After being encoded, MPEG video bit stream 105 includes frequency coefficients 106, motion vectors 107, and header information 108, which specifies size, picture coding type, etc.
To reconstruct the original sequence of video images, inverse operations are performed, as illustrated by MPEG decoder 110 in FIG. 1b. Frequency coefficients 106 are dequantized and passed though inverse discrete cosine transform (IDCT) module 111, thus converting them back into spatial domain representations. Motion vector module 112 uses header information 108 and motion vectors 107 to recreate the macroblocks of P-frames and B-frames. The outputs from IDCT module 111 and motion vector module 112 are then summed by summer 113 to generate reconstructed output 114. Reconstructed output 114 is a sequence of video images similar to video data 104 from FIG. 1a, and can be displayed on a display device.
HDTV video frames consist of 1088 lines, each having 1920 pixels, which results in approximately two million pixels per frame. As alluded to previously, since MPEG uses a YUV420 color space, one pixel is represented using 1.5 bytes. Thus a single HDTV frame uses 3 MB. Since two reference images are maintained in order to correctly decode B-frames, and double buffering is usually desired at the output of the MPEG decoder so that it can decode an image while the video output displays the previous image, this implies that 12 MB of storage are needed for the frames of video data generated by the MPEG decoding process and the associated reference buffers. Equivalently, a standard resolution, NTSC frame consists of 480 lines, each having 720 pixels, or approximately 350,000 pixels per frame. With the YUV420 format, this means that each NTSC frame uses about 520 KB of memory. As a result, the decoder and display device for processing and displaying NTSC video data encoded using MPEG requires about 2.1 MB of storage for the frames of video data generated by the MPEG decoding process and the associated reference buffers.
In North America and other regions, video frames are conventionally sent at the rate of thirty frames per second. The memory bandwidth needed to store the output video data is ˜90 MB/sec for HDTV and ˜15.5 MB/sec for standard resolution broadcasts. In addition, MPEG decoding requires that predictions be made from reference images. During periods of worst case predictions, up to 4 times that amount of bandwidth may need to be supported (depending on the memory subsystem).
Due to the large installed base of NTSC televisions, it may often be the case that a video image having been compressed using MPEG and formatted for display on a HDTV device may need to be displayed on a lower resolution NTSC television. It may also be the case that any such video images may need to be displayed in a lower resolution, such as when using picture-in-picture functionality of a television. A conventional method for supporting this application is to fully decode the transmitted images at their native resolution, then resample the transmitted images to the required display resolution. However, decoding an MPEG video bit stream having full HDTV formatting and then resampling to a lower resolution wastes memory resources and computational resources, since the receiving device cannot display the full resolution of the image. As a result, certain methods could be used to reduce the memory footprint, memory throughput and the processing requirements for this application. FIGS. 1c and 1d illustrate methods that could be used to reduce the memory footprint, memory throughput and the processing requirements. It is noted that the following methods do not necessarily represent prior art with respect to the present invention, but are presented herein to illustrate the advantages of the present invention compared to other approaches that could be implemented.
One class of methods involves modifications to the video data before transmission, such resampling to the desired display resolution or hierarchical encoding. All these methods can produce very good image quality. However, they are all limited in that they cannot support any and all desired output resolutions simultaneously in a broadcast environment, since the processing is performed prior to transmission, rather than at the decoding or display devices where the images are to be displayed. Also, most of these methods would involve non-standard profiles of MPEG video compression.
Another class of methods uses algorithms that are executed on the receiver. These methods attempt to reduce the size of the decompressed video images and the associated reference buffers. These reductions in size have an effect of reducing memory footprint for the buffers, reducing memory bandwidth for processing the decompressed video images, and reducing image resampling computational requirements. Most of these algorithms entail reducing the number of samples in frames in the horizontal and vertical directions by a factor of 2N, where N is normally 1.
One method, as shown in FIG. 1c, involves resampling the video frame after the frame has been decompressed using MPEG decoder 110 and prior to storing the decompressed frame in memory. This method can reduce memory footprint by a factor of four if the video frame is subsampled by a factor of two in the horizontal and vertical directions. This involves subsampling motion vectors 107 by a factor of two, then upsampling fetched motion reconstruction data 115 by factor of two in the horizontal and vertical directions. In a parallel operation, frequency coefficients 106 are dequantized and passed through IDCT module 111, which converts the coefficients back into spatial domain data 116. Spatial domain data 116 and the upsampled fetched motion reconstruction data 115 are then summed by summer 113. The output of summer 113 is then subsampled by a factor of two in each direction. This method is hindered by the fact that the output subsampling may require some extra buffering in order to allowing vertical filtering. Also, for relatively static scenes or constant pans, the error terms coming from the IDCT are nearly zero, which results in effectively the same image data being upsampled and downsampled many generations. This generational loss progressively degrades the image quality until an I-frame is decoded, in which case the image is refreshed. This results in a “beating” effect that is most noticeable and irritating to the viewer.
Another conventional method, as shown in FIG. 1d, involves ignoring all high frequency coefficients and using a 4×4 IDCT instead of an 8×8 IDCT. Similar to the method in FIG. 1c, motion vectors 107 are downsampled. However, fetched motion reconstruction data 115 may be directly summed with spatial domain data 116 without requiring post processing of the summed result, which reduces the effect of generational loss described above in reference to FIG. 1c. This method reduces memory footprint by a factor of four and significantly reduces the number of computations. However, simply ignoring the high frequency IDCT components can produce some significant artifacts at the boundaries of blocks and macroblocks in the decoded image (otherwise know as “block” artifacts). These artifacts in turn can significantly affect subsequent images that use the previous ones as references. Also, the 4×4 IDCT is slightly different from the 8×8 IDCT and for some hardware implementations is not easily changed.
In view of the foregoing, there exists a need for systems and methods for efficiently subsampling video data in preparation for displaying the video data on devices of lower resolution than what the video data was originally encoded for, thereby reducing the memory bandwidth and memory footprint needed to process the video data without appreciably reducing the quality of the output.