The present invention relates to a down converting decoder for down converting and decoding high resolution encoded video for display by a lower resolution receiver.
The international standard ISO/IEC 13818-2 entitled xe2x80x9cGeneric Coding of Moving Pictures and Associated Audio Information: Videoxe2x80x9d and xe2x80x9cGuide to the Use of the ATSC Digital Television Standardxe2x80x9d describe a system, known as MPEG-2, for encoding and decoding digital video data. The standard allows for the encoding of video over a wide range of resolutions, including higher resolutions commonly known as HDTV. According to this standard, digital video data is encoded as a series of code words in a complicated manner that causes the average length of the code words to be much smaller than would be the case if, for example, each pixel were coded as an eight bit value. This type of encoding is also known as data compression.
In the system described above, encoded pictures are made up of pixels. Each 8xc3x978 array of pixels is known as a block, and a 2xc3x972 array of these 8xc3x978 blocks is termed a macroblock. Compression is achieved using the well known techniques of prediction (motion estimation in the encoder, motion compensation in the decoder), two dimensional discrete cosine transformation (DCT) which is performed on 8xc3x978 blocks of pixels, quantization of DCT coefficients, and Huffman and run/level coding. However, prediction is not used for every picture. Thus, while P pictures are encoded with prediction from previous pictures, and while B pictures are encoded using prediction from either a previous or a subsequent picture, I pictures are encoded without prediction.
An MPEG-2 encoder 10 is shown in simplified form in FIG. 1. Data representing macroblocks of pixels are fed to both a subtractor 12 and a motion estimator 14. In the case of P pictures and B pictures, the motion estimator 14 compares each new macroblock to be encoded with the macroblocks in a reference picture previously stored in a reference picture memory 16. The motion estimator 14 finds the macroblock in the stored reference picture that most closely matches the new macroblock.
The motion estimator 14 reads this matching macro-block (known as a predicted macroblock) out of the reference picture memory 16 and sends it to the subtractor 12 which subtracts it, on a pixel by pixel basis, from the new macro-block entering the MPEG-2 encoder 10. The output of the subtractor 12 is an error, or residual, that represents the difference between the new macroblock being encoded and the predicted macroblock. This residual is often very small. The residual is transformed from the spatial domain by a two dimensional DCT 18. The DCT coefficients resulting from the two dimensional DCT 18 are then quantized by a quantization block 20 in a process that reduces the number of bits needed to represent each coefficient. Usually, many coefficients are effectively quantized to zero. The quantized DCT coefficients are Huffman and run/level coded by a coder 22 which further reduces the average number of bits per coefficient.
The motion estimator 14 also calculates a motion vector (mv) which represents the horizontal and vertical displacements of the predicted macroblock in the reference picture from the position of the new macroblock in the current picture being encoded. It should be noted that motion vectors may have xc2xd pixel resolution which is achieved by linear interpolation between adjacent pixels. The data encoded by the coder 22 are combined with the motion vector data from the motion estimator 14 and with other information (such as an indication of whether the picture is an I, P or B picture), and the combined data are transmitted to a receiver that includes an MPEG-2 decoder 30 (shown in FIG. 2 and discussed below).
For the case of P pictures, the quantized DCT coefficients from the quantization block 20 are also supplied to an internal decoder loop that represents a portion of the operation of the MPEG-2 decoder 30. Within this internal loop, the residual from the quantization block 20 is inverse quantized by an inverse quantization block 24 and is inverse DCT transformed by an inverse discrete cosine transform (IDCT) block 26. The predicted macroblock, that is read out of the reference picture memory 16 and that is supplied to the subtractor 12, is also added back to the output of the IDCT block 26 on a pixel by pixel basis by an adder 28, and the result is stored back into the reference picture memory 16 in order to serve as a macroblock of a reference picture for predicting subsequent pictures. The object of this internal loop is to have the data in the reference picture memory 16 of the MPEG-2 encoder 10 match the data in the reference picture memory of the MPEG-2 decoder 30. B pictures are not stored as reference pictures.
In the case of I pictures, no motion estimation occurs and the negative input to the subtractor 12 is forced to zero. In this case, the quantized DCT coefficients provided by the two dimensional DCT 18 represent transformed pixel values rather than transformed residual values, as is the case with P and B pictures. As in the case of P pictures, decoded I pictures are stored as reference pictures.
The MPEG-2 decoder 30 illustrated in FIG. 2 is a simplified showing of an MPEG-2 decoder. The decoding process implemented by the MPEG-2 decoder 30 can be thought of as the reverse of the encoding process implemented by the MPEG-2 encoder 10. Accordingly, the received encoded data is Huffman and run/level decoded by a Huffman and run/level decoder 32. Motion vectors and other information are parsed from the data stream flowing through the Huffman and run/level decoder 32. The motion vectors are fed to a motion compensator 34. Quantized DCT coefficients at the output of the Huffman and run/level decoder 32 are fed to an inverse quantization block 36 and then to an IDCT block 38 which transforms the inverse quantized DCT coefficients back into the spatial domain.
For P and B pictures, each motion vector is translated by the motion compensator 34 to a memory address in order to read a particular macroblock (predicted macroblock) out of a reference picture memory 42 which contains previously stored reference pictures. An adder 44 adds this predicted macroblock to the residual provided by the IDCT block 38 in order to form reconstructed pixel data. For I pictures, there is no prediction, so that the prediction provided to the adder 44 is forced to zero. For I and P pictures, the output of the adder 44 is fed back to the reference picture memory 42 to be stored as a reference picture for future predictions.
The MPEG encoder 10 can encode sequences of progressive or interlaced pictures. For sequences of interlaced pictures, pictures may be encoded as field pictures or as frame pictures. For field pictures, one picture contains the odd lines of the raster, and the next picture contains the even lines of the raster. All encoder and decoder processing is done on fields. Thus, the DCT transform is performed on 8xc3x978 blocks that contain all odd or all even numbered lines. These blocks are referred to as field DCT coded blocks.
On the other hand, for frame pictures, each picture contains both odd and even numbered lines of the raster. Macroblocks of frame pictures are encoded as frames in the sense that an encoded macroblock contains both odd and even lines. However, the DCT performed on the four blocks within each macroblock of a frame picture may be done in two different ways. Each of the four DCT transform blocks in a macroblock may contain both odd and even lines (frame DCT coded blocks), or alternatively two of the four DCT blocks in a macroblock may contain only the odd lines of the macroblock and the other two blocks may contain only the even lines of the macroblock (field DCT coded blocks). See ISO/IEC 13818-2, section 6.1.3, FIGS. 6-13 and 6-14. The coding decision as to which way to encode a picture may be made adaptively by the MPEG-2 encoder 10 based upon which method results in better data compression.
Residual macroblocks in field pictures are field DCT coded and are predicted from a reference field. Residual macroblocks in frame pictures that are frame DCT coded are predicted from a reference frame. Residual macroblocks in frame pictures that are field DCT coded have two blocks predicted from one reference field and two blocks predicted from either the same or the other reference field.
For sequences of progressive pictures, all pictures are frame pictures with frame DCT coding and frame prediction.
MPEG-2, as described above, includes the encoding and decoding of video at high resolution (HDTV). In order to permit people to use their existing NTSC televisions so as to view HDTV transmitted programs, it is desirable to provide a decoder that decodes high resolution MPEG-2 encoded data as reduced resolution video data for display on existing NTSC televisions. (Reducing the resolution of television signals is often called down conversion decoding.) Accordingly, such a down converting decoder would allow the viewing of HDTV signals without requiring viewers to buy expensive HDTV displays.
There are known techniques for making a down converting decoder such that it requires less circuitry and is, therefore, cheaper than a decoder that outputs full HDTV resolution. One of these methods is disclosed in U.S. Pat. No. 5,262,854. The down conversion technique disclosed there is explained herein in connection with a down convertor 50 shown in FIG. 3. The down convertor 50 includes a Huffman and run/level decoder 52 and an inverse quantization block 54 which operate as previously described in connection with the Huffman and run/level decoder 32 and the inverse quantization block 36 of FIG. 2. However, instead of utilizing the 8xc3x978 IDCT block 38 as shown in FIG. 2, the down convertor 50 employs a down sampler 56 which discards the forty-eight high order DCT coefficients of an 8xc3x978 block and performs a 4xc3x974 IDCT on the remaining 4xc3x974 array of DCT coefficients. This process is usually referred to as DCT domain down sampling. The result of this down sampling is effectively a filtered and down sampled 4xc3x974 block of residual samples (for P or B pictures) or pixels for I pictures.
For residual samples, a prediction is added by an adder 58 to the residual samples from the down sampler 56 in order to produce a decoded reduced resolution 4xc3x974 block of pixels. This block is saved in a reference picture memory 60 for subsequent predictions. Accordingly, predictions will be made from a reduced resolution reference, while predictions made in the decoder loop within the encoder are made from full resolution reference pictures. This difference means that the prediction derived from the reduced resolution reference will differ by some amount from the corresponding prediction made by the encoder, resulting in error in the residual-plus-prediction sum provided by the adder 58 (this error is referred to herein as prediction error). Prediction error may increase as predictions are made upon predictions until the reference is refreshed by the next I picture.
A motion compensator 62 attempts to reduce this prediction error by using the full resolution motion vectors, even though the reference picture is at lower resolution. First, a portion of the reference picture that includes the predicted macroblock is read from the reference picture memory 60. This portion is selected based on all bits of the motion vector except the least significant bit. Second, this predicted macroblock is interpolated back to full resolution by a 2xc3x972 prediction up sample filter 64. Third, using the full resolution motion vector (which may include xc2xd pixel resolution), a predicted full resolution macroblock is extracted from the up sampled portion based upon all of the bits of the motion vector. Fourth, a down sampler 66 performs a 2xc3x972 down sampling on the extracted full resolution macroblock in order to match the resolution of the 4xc3x974 IDCT output of the down sampler 56. In this way, the prediction from the reference picture memory 60 is up sampled to match the full resolution residual pixel structure, allowing the use of full resolution motion vectors. Then, the full resolution reference picture is down sampled prior to addition by the adder 58 in order to match the resolution of the down sampled residual from the down sampler 56.
There are several known good prediction up sampling/down sampling methods that tend to minimize the prediction error caused by up sampling reference pictures that have been down sampled with a 4xc3x974 IDCT. These methods typically involve the use of a two dimensional filter having five to eight taps and tap values that vary both with the value of the motion vector for the predicted macroblock relative to the nearest macroblock boundaries in the reference picture, and with the position of the current pixel being interpolated within the predicted macroblock. Such a filter not only up samples the reduced resolution reference to full resolution and subsequently down samples in a single operation, but it can also include xc2xd pixel interpolation (when required due to an odd valued motion vector). (See, for example, xe2x80x9cMinimal Error Drift in Frequency Scalability for Motion Compensated DCT Coding,xe2x80x9d Mokry and Anastassiou, IEEE Transactions on Circuits and Systems for Video Technology, August 1994, and xe2x80x9cDrift Minimization in Frequency Scaleable Coders Using Block Based Filtering,xe2x80x9d Johnson and Princen, IEEE Workshop on Visual Signal Processing and Communication, Melbourne, Australia, September 1993.)
A more general derivation of minimum drift prediction filters by using the Moore-Penrose inverse of a block based down sampling filter is described in xe2x80x9cMinimum Drift Architectures for 3-Layer Scalable DTV Decoding,xe2x80x9d Vetro, Sun, DaGraca and Poon, IEEE Transactions on Consumer Electronics, August 1998. For methods of down sampling other than by the use of the four point IDCT, the Vetro paper describes how to determine the optimum tap values for the up sampling filter. This up sampling can also be combined with the linear interpolation and down sampling operations to form a single prediction filter.
The down convertor 50, while generally adequate for progressive pictures with frame DCT coded blocks, does not address problems that arise when attempting to down convert sequences of interlaced pictures with mixed frame and field DCT coded blocks. These problems arise with respect to vertical down sampling and vertical prediction filtering.
Also, vertical down sampling for field pictures is performed on incoming field coded blocks and for frame pictures is performed on a mix of field and frame coded blocks. In the case of a mix of field and frame down sampled blocks, a given required prediction may overlap both types of blocks. This complication may be resolved by converting all incoming blocks to either frames or fields before down sampling. This conversion will result in a consistent vertical structure for reference pictures so that the same prediction filter can always be used.
For example, it has been suggested that all incoming pictures be converted to frames before performing vertical down sampling (see xe2x80x9cFrequency Domain Down Conversion of HDTV Using Adaptive Motion Compensation,xe2x80x9d by Vetro, Sun, Bao and Poon, ICIP ""97). Conversion to frames before performing vertical down sampling will result in better vertical resolution than would field based down sampling. However, frame based down sampling requires additional memory in the decoder because a first field must be stored when received in order to allow the second field to arrive so that frame blocks may be formed. Also, severe 20 blocking artifacts in motion sequences may occur (see xe2x80x9cFrequency Domain Down Conversion of HDTV Using an Optimal Motion Compensation Scheme,xe2x80x9d by Vetro and Sun, Journal of Imaging Science and Technology, August 1998).
As suggested in the latter paper, conversion of all incoming pictures to fields before performing vertical down sampling avoids these problems. However, field macroblock processing generally produces a softer picture.
A third process vertically downsamples some macroblocks as fields and some as frames. The vertical downsampling of each macroblock is determined by the manner in which the corresponding macroblock was encoded. In other words, each macroblock is vertically downsampled as a frame if the encoder decided that macroblock was a frame, and each macroblock is vertically downsampled as a field if the encoder decided that macroblock was a field.
It may seem that this criterion of vertically downsampling each macroblock according to the manner in which it was encoded is correct. However, the encoder makes it decisions expecting a full resolution decoder. It has been observed that the encoder sometimes decides to encode a macroblock as a frame even though the macroblock contains field content. The downsampling decoders described above create visible artifacts when processing such macroblocks.
The present invention is directed to a decoder which decides whether each macroblock should be frame or field processed based upon which will produce the least artifacts. Errors are computed based upon several filters that can be used in the down conversion processing, with the filter producing the least error being selected to do the actual processing. These errors, for example, may be sum-squared-errors (SSE""s).
In accordance with one aspect of the present invention, a method of downsampling a received picture to a lower resolution comprises a) horizontally downsampling the received picture, b) calculating at least first and second errors, wherein the first error is calculated based upon the received picture and a downsampled/upsampled version of the received picture derived from a first vertical downsampling filter, and wherein the second error is calculated based upon the received picture and a downsampled/upsampled version of the received picture derived from a second vertical downsampling filter; c) vertically downsampling the received picture using the first vertical downsampling filter if the first error is less than the second error; and, d) vertically downsampling the received picture using the second vertical downsampling filter if the second error is less than the first error.
In accordance with another aspect of the present invention, a downsampling apparatus downsamples a picture being processed and comprises an error calculation module and a filtering module. The error calculation module calculates first and second errors, where the first error is calculated based upon a difference between the picture being processed and a first version of the picture being processed derived from a first downsampling filter, and where the second error is calculated based upon a difference between the picture being processed and a second version of the picture being processed derived from a second downsampling filter. The filtering module downsamples the picture being processed using the first downsampling filter if the first error is less than the second error, and downsamples the picture being processed using the second downsampling filter if the second error is less than the first error.
In accordance with still another aspect of the present invention, a downsampling apparatus converts a received DCT coefficient macroblock to a reconstructed pixel block and comprises an IDCT module, a horizontal downsampler, a calculation module, a filtering module, and a motion compensator. The IDCT module performs an inverse DCT on the received DCT coefficient macroblock to produce a first intermediate block. The horizontal downsampler horizontally downsamples the first intermediate block to produce a second intermediate block. The calculation module calculates first and second errors. The first error is calculated based upon a difference between one of the first and second intermediate blocks and a first downsampled/upsampled version of the one of the first and second intermediate blocks derived from a first vertical downsampling filter, and the second error is calculated based upon a difference between the one of the first and second intermediate blocks and a second downsampled/upsampled version of the one of the first and second intermediate blocks derived from a second vertical downsampling filter. The filtering module vertically downsamples the second intermediate block using the first vertical downsampling filter if the first error is less than the second error, and vertically downsamples the second intermediate block using the second vertical downsampling filter if the second error is less than the first error. The motion compensator adds prediction reference pixels to the horizontally and vertically downsampled block, as appropriate, to form reconstructed pixels.