The present invention relates to a downconverting decoder for downconverting and decoding high resolution encoded video for display by a lower resolution receiver.
The international standard ISO/IEC 13818-2 (Generic Coding of Motion Pictures and Associated Audio Information: Video) and the xe2x80x9cGuide to the use of the ATSC Digital Television Standardxe2x80x9d describe a system, known as MPEG-2, for encoding and decoding digital video data. According to this system, digital video data is encoded as a series of code words in a complicated manner that causes the average length of the code words to be much smaller than would be the case if, for example, each pixel in every frame was coded as an eight bit value. This type of encoding is also known as data compression.
The standard allows for encoding of video over a wide range of resolutions, including higher resolutions commonly known as HDTV. In MPEG-2, encoded pictures are made up of pixels. Each 8xc3x978 array of pixels is known as a block, and a 2xc3x972 array of blocks is known as a macroblock. Compression is achieved by using well known techniques including (i) prediction (motion estimation in the encoder and motion compensation in the decoder), (ii) two dimensional discrete cosine transform (DCT) which is performed on 8xc3x978 blocks of pixels, (iii) quantization of the resulting DCT coefficients, and (iv) Huffman and run/level coding. In MPEG-2 encoding, pictures which are encoded without prediction are referred to as I pictures, pictures which are encoded with prediction from previous pictures are referred to as P pictures, and pictures which are encoded with prediction from both previous and subsequent pictures are referred to as B pictures.
An MPEG-2 encoder 10 is shown in simplified form in FIG. 1. Data representing macroblocks of pixel values are fed to both a subtractor 12 and a motion estimator 14. In the case of P pictures and B pictures, the motion estimator 14 compares each new macroblock (i.e., a macroblock to be encoded) with the macroblocks in a reference picture previously stored in a reference picture memory 16. The motion estimator 14 finds the macroblock in the stored reference picture that most closely matches the new macroblock.
The motion estimator 14 reads this matching macroblock (known as a predicted macroblock) out of the reference picture memory 16 and sends it to the subtractor 12 which subtracts it, on a pixel by pixel basis, from the new macroblock entering the MPEG-2 encoder 10. The output of the subtractor 12 is an error, or residual, that represents the difference between the predicted macroblock and the new macroblock being encoded. This residual is often very small. The residual is transformed from the spatial domain by a two dimensional DCT 18. The DCT residual coefficients resulting from the two dimensional DCT 18 are then quantized by a quantization block 20 in a process that reduces the number of bits needed to represent each coefficient. Usually, many coefficients are effectively quantized to zero. The quantized DCT coefficients are Huffman and run/level coded by a coder 22 which further reduces the average number of bits per coefficient.
The motion compensator 14 also calculates a motion vector (mv) which represents the horizontal and vertical displacement of the predicted macroblock in the reference picture from the position of the new macroblock in the current picture being encoded. It should be noted that motion vectors may have xc2xd pixel resolution which is achieved by linear interpolation between adjacent pixels. The data encoded by the coder 22 are combined with the motion vector data from the motion estimator 14 and with other information (such as an indication of whether the picture is an I, P or B picture), and the combined data are transmitted to a receiver that includes an MPEG-2 decoder 30.
For the case of P pictures, the quantized DCT coefficients from the quantization block 20 are also supplied to an internal loop that represents the operation of the MPEG-2 decoder 30. Within this internal loop, the residual from the quantization block 20 is inverse quantized by an inverse quantization block 24 and is inverse DCT transformed by an inverse discrete cosine transform (IDCT) block 26. The predicted macroblock, that is read out of the reference picture memory 16 and that is supplied to the subtractor 12, is also added back to the output of the IDCT block 26 on a pixel by pixel basis by an adder 28, and the result is stored back into the reference picture memory 16 in order to serve as a macroblock of a reference picture for predicting subsequent pictures. The object of this internal loop is to have the data in the reference picture memory 16 of the MPEG-2 encoder 10 match the data in the reference picture memory of the MPEG-2 decoder 30. B pictures are not stored as reference pictures.
In the case of I pictures, no motion estimation occurs and the negative input to the subtractor 12 is forced to zero. In this case, the quantized DCT coefficients provided by the two dimensional DCT 18 represent transformed pixel values rather than residual values, as is the case with P and B pictures. As in the case of P pictures, decoded I pictures are stored as reference pictures.
The MPEG-2 decoder 30 illustrated in FIG. 2 is a simplified showing of an MPEG-2 decoder. The decoding process implemented by the MPEG-2 decoder 30 can be thought of as the reverse of the encoding process implemented by the MPEG-2 encoder 10. The received encoded data is Huffman and run/level decoded by a Huffman and run/level decoder 32. Motion vectors and other information are parsed from the data stream flowing through the Huffman and run/level decoder 32. The motion vectors are fed to a motion compensator 34. Quantized DCT coefficients at the output of the Huffman and run/level decoder 32 are fed to an inverse quantization block 36 and then to an IDCT block 38 which transforms the inverse quantized DCT coefficients back into the spatial domain.
For P and B pictures, each motion vector is translated by the motion compensator 34 to a memory address in order to read a particular macroblock (predicted macroblock) out of a reference picture memory 42 which contains previously stored reference pictures. An adder 44 adds this predicted macroblock to the residual provided by the IDCT block 38 in order to form reconstructed pixel data. For I pictures, there is no reference picture so that the prediction provided to the adder 44 is forced to zero. For I and P pictures, the output of the adder 42 is fed back to the reference picture memory 42 to be stored as a reference picture for future predictions.
The MPEG encoder 10 can encode sequences of progressive or interlaced pictures. For sequences of interlaced pictures, pictures may be encoded as field pictures or as frame pictures. For field pictures, one picture contains the odd lines of the raster, and the next picture contains the even lines of the raster. All encoder and decoder processing is done on fields. Thus, the DCT transform is performed on 8xc3x978 blocks that contain all odd or all even numbered lines. These blocks are referred to as field DCT coded blocks.
On the other hand, for frame pictures, each picture contains both odd and even numbered lines of the raster. Macroblocks of frame pictures are encoded as frames in the sense that an encoded macroblock contains both odd and even lines. However, the DCT performed on the four blocks within each macroblock of a frame picture may be done in two different ways. Each of the four DCT transform blocks in a macroblock may contain both odd and even lines (frame DCT coded blocks), or alternatively two of the four DCT blocks in a macroblock may contain only the odd lines of the macroblock and the other two blocks may contain only the even lines of the macroblock (field DCT coded blocks). The coding decision as to which way to encode a picture may be made adaptively by the MPEG-2 encoder 10 based upon which method results in better data compression.
Residual macroblocks in field pictures are field DCT coded and are predicted from a reference field. Residual macroblocks in frame pictures that are frame DCT coded are predicted from a reference frame. Residual macroblocks in frame pictures that are field DCT coded have two blocks predicted from one reference field and two blocks predicted from either the same or other reference field.
For sequences of progressive pictures, all pictures are frame pictures with frame DCT coding and frame prediction.
MPEG-2, as described above, includes the encoding and decoding of video at high resolution (HDTV). In order to permit people to use their existing NTSC televisions in order to view HDTV transmitted programs, it is desirable to provide a decoder that decodes high resolution MPEG-2 encoded data as reduced resolution video data for display on existing NTSC televisions. (Reducing the resolution of television signals is often called down conversion decoding.) Accordingly, such a downconverting decoder would allow the viewing of HDTV signals without requiring viewers to buy expensive HDTV displays.
There are known techniques for making such a downconverting decoder such that it requires less circuitry and is, therefore, cheaper than a decoder that outputs full HDTV resolution. One of these methods is disclosed in U.S. Pat. No. 5,262,854. The down conversion technique disclosed there is explained herein in connection with a down convertor 50 shown in FIG. 3. The down convertor 50 includes a Huffman and run/level decoder 52 and an inverse quantization block 54 which operate as previously described in connection with the Huffman and run/level decoder 32 and the inverse quantization block 36 of FIG. 2. However, instead of utilizing the 8xc3x978 IDCT block 38 as shown in FIG. 2, the down convertor 50 employs a downsampler 56 which discards the forty-eight high order DCT coefficients of an 8xc3x978 block and performs a 4xc3x974 IDCT on the remaining 4xc3x974 array of DCT coefficients. This process is usually referred to as DCT domain downsampling. The result of this downsampling is effectively a filtered and downsampled 4xc3x974 block of residual samples (for P or B pictures) or pixels for I pictures.
For residual samples, a prediction is added by an adder 58 to the residual samples from the downsampler 56 in order to produce a decoded reduced resolution 4xc3x974 block of pixels. This block is saved in a reference picture memory 60 for subsequent predictions. Accordingly, predictions will be made from a reduced resolution reference, while predictions made in the decoder loop within the encoder are made from full resolution reference pictures. This difference means that the prediction derived from the reduced resolution reference will differ by some amount from the corresponding prediction made by the encoder, resulting in error in the residual plus prediction sum provided by the adder 58 (this error is referred to herein as prediction error). This error may increase as predictions are made upon predictions until the reference is refreshed by the next I picture.
A motion compensator 62 attempts to reduce this prediction error by using the full resolution motion vectors, even though the reference picture is at lower resolution. First, a portion of the reference picture that includes the predicted macroblock is read from the reference picture memory 60. This portion is selected based on all bits of the motion vector except the least significant bit. This predicted macroblock is interpolated back to full resolution by a 2xc3x972 prediction upsample filter 64. Using the full resolution motion vector (which may include xc2xd pixel resolution), a predicted full resolution macroblock is extracted from the upsampled portion based upon all of the bits of the motion vector. Then, a downsampler 66 performs a 2xc3x972 downsampling on the extracted full resolution macroblock in order to match the resolution of the 4xc3x974 IDCT output of the downsampler 56. In this way, the prediction from the reference picture memory 60 is upsampled to match the full resolution residual pixel structure allowing the use of full resolution motion vectors. Then, the full resolution reference picture is downsampled prior to addition by the adder 58 in order to match the resolution of the downsampled residual from the downsampler 56.
There are several known good prediction upsampling/downsampling methods that tend to minimize the prediction error caused by upsampling reference pictures that have been downsampled with a 4xc3x974 IDCT. These methods typically involve use of a two dimensional filter having five to eight taps and tap values that vary both with the motion vector value for the predicted macroblock and the position of the current pixel being interpolated within the predicted macroblock. Such a filter not only upsamples the reduced resolution reference to full resolution and subsequently downsamples in a single operation, but it can also include additional xc2xd pixel interpolation (when required due to a fractional motion vector). (See, for example, Minimal Error Drift in Frequency Scalability for Motion Compensated DCT Coding, Mokry and Anastassiou, IEEE Transactions on Circuits and Systems for Video Technology, August 1994, and Drift Minimization in Frequency Scaleable Coders Using Block Based Filtering, Johnson and Princen, IEEE Workshop on Visual Signal Processing and Communication, Melbourne, Australia, September 1993.) The objective of such upsampling and downsampling is for the prediction upsampling filter to be a close spatial domain approximation to the effective filtering operation done by a 4xc3x974 IDCT.
The following example is representative of the prediction upsampling/downsampling filter described in the Mokry and Johnson papers. This example is a one dimensional example but is easily extended to two dimensions. Let it be assumed that pixels y1 and pixels y2 as shown in FIG. 4 represent two adjacent blocks in a downsampled reference picture, and that the desired predicted block stradles the boundary between the two blocks. The pixels y1 are upsampled to the pixels p1 by using a four tap filter with a different set of tap values for each of the eight calculated pixels p1. The pixels y2 are likewise upsampled to the pixels p2 by using the same four tap filter arrangement. (If the motion vector requires xc2xd pixel interpolation, this interpolation is done using linear interpolation to calculate in between pixel values based on the pixels p1 and p2.) From these sixteen pixels p1 and pixels p2, an upsampled prediction consisting of eight pixels q can be read using the full resolution motion vector. The pixels q are then filtered and downsampled to pixels qxe2x80x2 by an eight tap filter with a different set of tap values for each of the four pixels qxe2x80x2. The Johnson paper teaches how to determine the optimum tap values for these filters given that the reference picture was downsampled by a four point IDCT. The tap points are optimum in the sense that the prediction error is minimized. The Johnson and Mokry papers also show that the upsampling, linear interpolation, and downsampling filters can be combined into a single eight tap filter with tap values that depend on the motion vector value and the particular pixels qxe2x80x2 being calculated. Accordingly, this single eight tap filter allows four pixels qxe2x80x2 to be calculated directly from the eight pixels y1 and y2.
The down convertor 50, while generally adequate for progressive pictures with frame DCT coded blocks, does not address problems that arise when attempting to down convert sequences of interlaced pictures with mixed frame and field DCT coded blocks. These problems arise, for the most part, with respect to vertical prediction upsampling, and are described below in a one dimensional vertical context. Thus, for the purpose of this description, a full resolution block refers to an eight pixel vertical column with a downsampled block having a corresponding vertical column of four pixels.
Let it be assumed that an eight point vertical column of pixels as shown in column 70 of FIG. 5 is transformed into DCT coefficients by an encoder utilizing an eight point DCT transform operation. A downconverting decoder discards the four high order coefficients for each block and performs a four point IDCT on the remaining coefficients (DCT domain downsampling). The spatial relationship between the original pixels x and the decoded pixels y is shown by columns 70 and 72. The pixels y represent the stored reference picture.
Prediction upsampling/downsampling methods, such as those previously referenced (Mokry, Johnson), which operate on DCT domain downsampled reference pictures, result in the spatial relationships shown in FIG. 6, where the reference pixels y are first upsampled to produce upsampled reference pixels p (these approximate the original pixels x) and the upsampled reference pixels p are then downsampled to produce downsampled reference pixels q. These methods attempt to effectively reverse the DCT domain downsampling with a minimal or small error due to the discarding of the high order DCT coefficients when the 4xc3x974 IDCT is performed. The objective is for the prediction upsampling filter to be a close spatial domain approximation to the effective filtering operation done by a 4xc3x974 IDCT.
The typical operation of such a filter operating vertically is explained as follows. A portion of the lower resolution reference picture consisting of two pixel blocks (e.g., the y1 and y2 pixel blocks of column 80) overlapped by the desired predicted block is accessed. As shown in column 82, these two pixel blocks are upsampled and filtered to approximate the full resolution reference so that the pixels p1 and p2 approximate full resolution pixels x. Then, the pixels p1 and p2 are filtered and downsampled to produce pixels q as shown in column 84. The pixels q form the predicted block that is supplied to the adder 58.
This upsampling/downsampling process can either be a two step filtering process, or the pixels q can be directly calculated from the pixels y using, for example, an eight tap filter whose filter coefficients vary with the motion vector value and the particular pixels q being calculated, as described in the Johnson and Mokry papers. As shown in FIG. 7, prediction upsampling/downsampling filters can also include additional xc2xd pixel interpolation (approximation of pixel values between original pixels x) when the motion vector is fractional.
It is noted that the pixels y due to DCT domain downsampling are effectively spatially located half way between the original pixels x. This spatial relationship has important implications because, as previously explained, the DCT blocks may be frame or field encoded. For example, if it is assumed that a full resolution frame consisting of fields A and B is encoded by the encoder, and if these fields are field DCT encoded, the DCT domain downsampling must be performed by a down conversion decoder separately on each field block. The resulting vertical spatial relationship of pixels in the downsampled fields a and b with respect to pixels in the original fields A and B is shown in FIG. 8, where the original encoded fields A and B are shown in column 90 and the downsampled fields a and b are shown in column 92. It should be noted that the pixels b are not evenly spaced between the pixels a.
On the other hand, with frame DCT encoding, pixels from fields A and B are combined together into DCT blocks by the encoder. DCT domain downsampling on these frame DCT coded blocks results in the pixel spatial relationship shown in FIG. 9, where the original frame DCT encoded fields A and B are shown in column 94 and the downsampled frame c is shown in column 96. It should be noted that the pixels c are evenly spaced.
According to the MPEG-2 standard, a picture may be encoded with all macroblocks containing frame DCT coded blocks, with all macroblocks containing field DCT coded blocks, or with a mix of field and frame coded macroblocks. Therefore, performing DCT domain downsampling as shown by the prior art results in reference pictures that have a varying pixel structure. An entire reference picture may have the a/b structure shown in column 92 or the c structure shown in column 96. On the other hand, a reference picture may be composed of macroblocks, some having the a/b structure of column 92 and others having the c structure of column 96.
When forming a predicted macroblock, as previously explained, the reference picture must be upsampled so that it matches its original full resolution structure. The prediction upsampling operation is made more complicated because the two different reference picture pixel structures shown in columns 92 and 96 require different upsampling processes. Because the pixels in the c structured reference picture shown in column 96 have resulted from DCT domain downsampling of a frame, prediction upsample filtering must be performed on the reference macroblock as a frame to derive the A and B fields together. However, because the pixels in the a/b structured reference pictures shown in column 92 have resulted from DCT domain downsampling of separate fields, prediction upsample filtering must be performed separately on each field (a and b) of the reference macroblock in order to derive the A and then B fields shown in column 90.
A further complication is introduced when reference blocks have a mixed macroblock pixel structure because predicted macroblocks from the reference picture may straddle stored reference macroblocks, some having the c structure and some having the a/b structure. In this case, two different prediction upsample processes would have to be executed for different parts of the same predicted macroblock.
Moreover, a particular disadvantage of using the c structure shown in column 96 for reference pictures becomes apparent when it is necessary to do field prediction from a c structured reference, where the A/B structured full resolution reference contains high vertical frequencies. For example, if it is assumed that at full resolution the A/B reference is entirely composed of alternating black (A field) and white (B field) lines, a c structured downsampled reference would be composed of pixels that are approximately gray due to the mixing of the A and B pixels that occurs during filtering and downsampling. However, an a/b structured reference would have all black pixels for the a field and all white pixels for the b field because each field is filtered and downsampled separately. If the encoder decides to do a field prediction from the A field, a decoder with a c structured reference would read a prediction consisting of gray pixels. However, a decoder with an a/b structured reference would read a much more accurate prediction from the a field consisting of black pixels. Thus, the a/b structure avoids the field xe2x80x9cmixingxe2x80x9d in the decoder that occurs in the c structure.
The downconverting decoder of the present invention overcomes one or more of the problems inherent in the prior art.
In accordance with the present invention, a method of downconverting received frame and field coded DCT coefficient blocks to reconstructed pixel field blocks, wherein the frame and field coded DCT coefficient blocks have motion vectors associated therewith, comprises the following steps: a) converting the received frame coded DCT coefficient blocks to converted field coded DCT coefficient blocks and performing an IDCT on the converted field coded DCT coefficient blocks to produce residual or pixel field blocks; b) directly performing an IDCT on the received field coded DCT coefficient blocks to produce residual or pixel field blocks; c) selecting reference pixel blocks based upon the motion vectors, upsampling the reference pixel blocks, and downsampling at least a portion of the upsampled reference blocks to form a prediction; and, d) adding the prediction to the residual field blocks to form reconstructed field blocks.
In a more detailed aspect of the present invention, an apparatus for downconverting received frame and field coded DCT coefficient blocks to reconstructed pixel field blocks comprises an IDCT and a motion compensator. The IDCT is arranged to convert the received frame coded DCT coefficient blocks to converted field coded DCT coefficient blocks and to perform an IDCT on the converted field coded DCT coefficient blocks and on the received field coded DCT coefficient blocks in order to produce downconverted pixel related field blocks. The motion compensator is arranged to apply motion compensation, as appropriate, to the downconverted pixel related field blocks in order to produce the reconstructed pixel field blocks.
In a further more detailed aspect of the present invention, an apparatus for downconverting received frame and field coded DCT coefficient blocks to downconverted pixel related field blocks comprises first and second IDCT""s. The first IDCT is arranged to convert the received frame coded DCT coefficient blocks to converted field coded DCT coefficient blocks and to perform a downconverting IDCT on the converted field coded DCT coefficient blocks in order to produce first downconverted pixel related field blocks. The second IDCT is arranged to directly perform a downconverting IDCT on the received field coded DCT coefficient blocks in order to produce second downconverted pixel related field blocks.