This application claims subject matter similar to the subject matter disclosed in U.S. patent application Ser. No. 09/106,367 filed Jun. 29, 1998.
1. Technical Field of the Invention
The present invention relates to a down converting decoder for down converting and decoding high resolution encoded video for display by a lower resolution receiver.
2. Background of the Invention
The international standard ISO/IEC 13818-2 entitled xe2x80x9cGeneric Coding of Moving Pictures and Associated Audio Information: Videoxe2x80x9d and xe2x80x9cGuide to the Use of the ATSC Digital Television Standardxe2x80x9d describe a system, known as MPEG-2, for encoding and decoding digital video data. The standard allows for the encoding of video over a wide range of resolutions, including higher resolutions commonly known as HDTV. According to this standard, digital video data is encoded as a series of code words in a complicated manner that causes the average length of the code words to be much s smaller than would be the case if, for example, each pixel were coded as an 8 bit value. This type of encoding is also known as data compression.
In the system described above, encoded pictures are made up of pixels. Each 8xc3x978 array of pixels is known as a block, and a 2xc3x972 array of these 8xc3x978 blocks is termed a macroblock. Compression is achieved using the well known techniques of prediction (motion estimation in the encoder, motion compensation in the decoder), two dimensional discrete cosine transformation (DCT) which is performed on 8xc3x978 blocks of pixels, quantization of DCT coefficients, and Huffman and run/level coding. However, prediction is not used for every picture. Thus, while P pictures are encoded with prediction from previous pictures, and while B pictures are encoded using prediction from either a previous or a subsequent picture, I pictures are encoded without prediction.
An MPEG-2 encoder is shown in simplified form in FIG. 1. Data representing macroblocks of pixels are fed to both a subtractor 12 and a motion estimator 14. In the case of P pictures and B pictures, the motion estimator 14 compares each new macroblock to be encoded with the macroblocks in a reference picture previously stored in a reference picture memory 16. The motion estimator 14 finds the macroblock in the stored reference picture that most closely matches the new macroblock.
The motion estimator 14 reads this matching macroblock (known as a predicted macroblock) out of the reference picture memory 16 and sends it to the subtractor 12 which subtracts it, on a pixel by pixel basis, from the new macroblock entering the MPEG-2 encoder 10. The output of the subtractor 12 is an error, or residual, that represents the difference between the predicted macroblock and the new macroblock being encoded. This residual is often very small. The residual is transformed from the spatial domain by a two dimensional DCT 18. The DCT coefficients resulting from the two dimensional DCT 18 are then quantized by a quantization block 20 in a process that reduces the number of bits needed to represent each coefficient. Usually, many coefficients are effectively quantized to zero. The quantized DCT coefficients are Huffman and run/level coded by a coder 22 which further reduces the average number of bits per coefficient.
The motion estimator 14 also calculates a motion vector (mv) which represents the horizontal and vertical displacements of the predicted macroblock in the reference picture from the position of the new macroblock in the current picture being encoded. It should be noted that motion vectors may have xc2xd pixel resolution which is achieved by linear interpolation between adjacent pixels. The data encoded by the coder 22 are combined with the motion vector data from the motion estimator 14 and with other information (such as an indication of whether the picture is an I, P or B picture), and the combined data are transmitted to a receiver that includes an MPEG-2 decoder 30 (shown in FIG. 2 and discussed below).
For the case of P pictures, the quantized DCT coefficients from the quantization block 20 are also supplied to an internal decoder loop that represents a portion of the operation of the MPEG-2 decoder 30. Within this internal loop, the residual from the quantization block 20 is inverse quantized by an inverse quantization block 24 and is inverse DCT transformed by an inverse discrete cosine transform (IDCT) block 26. The predicted macroblock, that is read out of the reference picture memory 16 and that is supplied to the subtractor 12, is also added back to the output of the IDCT block 26 on a pixel by pixel basis by an adder 28, and the result is stored back into the reference picture memory 16 in order to serve as a macroblock of a reference picture for predicting subsequent pictures. The object of this internal loop is to have the data in the reference picture memory 16 of the MPEG-2 encoder 10 match the data in the reference picture memory of the MPEG-2 decoder 30. B pictures are not stored as reference pictures.
In the case of I pictures, no motion estimation occurs and the negative input to the subtractor 12 is forced to zero. In this case, the quantized DCT coefficients provided by the two dimensional DCT 18 represent transformed pixel values rather than residual values, as is the case with P and B pictures. As in the case of P pictures, decoded I pictures are stored as reference pictures.
The MPEG-2 decoder 30 illustrated in FIG. 2 is a simplified showing of an MPEG-2 decoder. The decoding process implemented by the MPEG-2 decoder 30 can be thought of as the reverse of the encoding process implemented by the MPEG-2 encoder 10. Accordingly, the received encoded data is Huffman and run/level decoded by a Huffman and run/level decoder 32. Motion vectors and other information are parsed from the data stream flowing through the Huffman and run/level decoder 32. The motion vectors are fed to a motion compensator 34. Quantized DCT coefficients at the output of the Huffman and run/level decoder 32 are fed to an inverse quantization block 36 and then to an IDCT block 38 which transforms the inverse quantized DCT coefficients back into the spatial domain.
For P and B pictures, each motion vector is translated by the motion compensator 34 to a memory address in order to read a particular macroblock (predicted macroblock) out of a reference picture memory 42 which contains previously stored reference pictures. An adder 44 adds this predicted macroblock to the residual provided by the IDCT block 38 in order to form reconstructed pixel data. For I pictures, there is no prediction, so that the prediction provided to the adder 44 is forced to zero. For I and P pictures, the output of the adder 44 is fed back to the reference picture memory 42 to be stored as a reference picture for future predictions.
The MPEG encoder 10 can encode sequences of progressive or interlaced pictures. For sequences of interlaced pictures, pictures may be encoded as field pictures or as frame pictures. For field pictures, one picture contains the odd lines of the raster, and the next picture contains the even lines of the raster. All encoder and decoder processing is done on fields. Thus, the DCT transform is performed on 8xc3x978 blocks that contain all odd or all even numbered lines. These blocks are referred to as field DCT coded blocks.
On the other hand, for frame pictures, each picture contains both odd and even numbered lines of the raster. Macroblocks of frame pictures are encoded as frames in the sense that an encoded macroblock contains both odd and even lines. However, the DCT performed on the four blocks within each macroblock of a frame picture may be done in two different ways. Each of the four DCT transform blocks in a macroblock may contain both odd and even lines (frame DCT coded blocks), or alternatively two of the four DCT blocks in a macroblock may contain only the odd lines of the macroblock and the other two blocks may contain only the even lines of the macroblock (field DCT coded blocks). See ISO/IEC 13818-2, section 6.1.3, FIGS. 6-13 and 6-14. The coding decision as to which way to encode a picture may be made adaptively by the MPEG-2 encoder 10 based upon which method results in better data compression.
Residual macroblocks in field pictures are field DCT coded and are predicted from a reference field. Residual macroblocks in frame pictures that are frame DCT coded are predicted from a reference frame. Residual macroblocks in frame pictures that are field DCT coded have two blocks predicted from one reference field and two blocks predicted from either the same or the other reference field.
For sequences of progressive pictures, all pictures are frame pictures with frame DCT coding and frame prediction.
MPEG-2, as described above, includes the encoding and decoding of video at high resolution (HDTV). In order to permit people to use their existing NTSC televisions so as to view HDTV transmitted programs, it is desirable to provide a decoder that decodes high resolution MPEG-2 encoded data as reduced resolution video data for display on existing NTSC televisions. (Reducing the resolution of television signals is often called down conversion decoding.) Accordingly, such a down converting decoder would allow the viewing of HDTV signals without requiring viewers to buy expensive HDTV displays.
There are known techniques for making a down converting decoder such that it requires less circuitry and is, therefore, cheaper than a decoder that outputs full HDTV resolution. One of these methods is disclosed in U.S. Pat. No. 5,262,854. The down conversion technique disclosed there is explained herein in connection with a down convertor 50 shown in FIG. 3. The down convertor 50 includes a Huffman and run/level decoder 52 and an inverse quantization block 54 which operate as previously described in connection with the Huffman and run/level decoder 32 and the inverse quantization block 36 of FIG. 2. However, instead of utilizing the 8xc3x978 IDCT block 38 as shown in FIG. 2, the down convertor 50 employs a down sampler 56 which discards the forty-eight high order DCT coefficients of an 8xc3x978 block and performs a 4xc3x974 IDCT on the remaining 4xc3x974 array of DCT coefficients. This process is usually referred to as DCT domain down sampling. The result of this down sampling is effectively a filtered and down sampled 4xc3x974 block of residual samples (for P or B pictures) or pixels for I pictures.
For residual samples, a prediction is added by an adder 58 to the residual samples from the down sampler 56 in order to produce a decoded reduced resolution 4xc3x974 block of pixels. This block is saved in a reference picture memory 60 for subsequent predictions. Accordingly, predictions will be made from a reduced resolution reference, while predictions made in the decoder loop within the encoder are made from full resolution reference pictures. This difference means that the prediction derived from the reduced resolution reference will differ by some amount from the corresponding prediction made by the encoder, resulting in error in the residual-plus-prediction sum provided by the adder 58 (this error is referred to herein as prediction error). Prediction error may increase as predictions are made upon predictions until the reference is refreshed by the next I picture.
A motion compensator 62 attempts to reduce this prediction error by using the full resolution motion vectors, even though the reference picture is at lower resolution. First, a portion of the reference picture that includes the predicted macroblock is read from the reference picture memory 60. This portion is selected based on all bits of the motion vector except the least significant bit. This predicted macroblock is interpolated back to full resolution by a 2xc3x972 prediction up sample filter 64. Using the full resolution motion vector (which may include xc2xd pixel resolution), a predicted full resolution macroblock is extracted from the up sampled portion based upon all of the is, bits of the motion vector. Then, a down sampler 66 performs a 2xc3x972 down sampling on the extracted full resolution macroblock in order to match the resolution of the 4xc3x974 IDCT output of the down sampler 56. In this way, the prediction from the reference picture memory 60 is up sampled to match the full resolution residual pixel structure allowing the use of full resolution motion vectors. Then, the full resolution reference picture is down sampled prior to addition by the adder 58 in order to match the resolution of the down sampled residual from the down sampler 56.
There are several known good prediction up sampling/down sampling methods that tend to minimize the prediction error caused by up sampling reference pictures that have been down sampled with a 4xc3x974 IDCT. These methods typically involve use of a two dimensional filter having five to eight taps and tap values that vary both with the motion vector value for the predicted macroblock relative to the nearest macroblock boundaries in the reference picture, and with the position of the current pixel being interpolated within the predicted macroblock. Such a filter not only up samples the reduced resolution reference to full resolution and subsequently down samples in a single operation, but it can also include xc2xd pixel interpolation (when required due to an odd valued motion vector). (See, for example, xe2x80x9cMinimal Error Drift in Frequency Scalability for Motion Compensated DCT Coding,xe2x80x9d Mokry and Anastassiou, IEEE Transactions on Circuits and Systems for Video Technology, August 1994, and xe2x80x9cDrift Minimization in Frequency Scaleable Coders Using Block Based Filtering,xe2x80x9d Johnson and Princen, IEEE Workshop on Visual Signal Processing and Communication, Melbourne, Australia, September 1993.)
A more general derivation of minimum drift prediction filters by using the Moore-Penrose inverse of a block based down sampling filter is described in xe2x80x9cMinimum Drift Architectures for 3-Layer Scalable DTV Decoding,xe2x80x9d Vetro, Sun, DaGraca and Poon, IEEE Transactions on Consumer Electronics, August 1998.
The following example is representative of the prediction up sampling/down sampling filter described in the Mokry and Johnson papers. This example is a one dimensional example but is easily extended to two dimensions. Let it be assumed that pixels y1 and pixels y2 as shown in FIG. 4 represent two adjacent blocks in a down sampled reference picture, and that the desired predicted block straddles the boundary between the two blocks. The pixels y1 are up sampled to the pixels p1 by using a four tap filter with a different set of tap values for each of the eight calculated pixels p1. The pixels y2 are likewise up sampled to the pixels p2 by using the same four tap filter. (If the motion vector requires xc2xd pixel interpolation, this interpolation is done using linear interpolation to calculate in between pixel values based on the pixels p1 and p2.) From these sixteen pixels p1 and pixels p2, an up sampled prediction consisting of eight pixels q can be read using the full resolution motion vector. The pixels q are then filtered and down sampled to pixels qxe2x80x2 by an eight tap filter with a different set of tap values for each of the four pixels qxe2x80x2. The Johnson paper teaches how to determine the optimum tap values for these filters given that the reference picture was down sampled by a four point IDCT. The tap values are optimum in the sense that the prediction error is minimized. The Johnson and Mokry papers also show that the up sampling, linear interpolation, and down sampling filters can be combined into a single eight tap filter with tap values that depend on the motion vector value relative to the nearest macroblock boundaries in the reference picture, and that depend on the particular pixels qxe2x80x2 being calculated. Accordingly, this single eight tap filter allows four pixels qxe2x80x2 to be calculated directly from the eight pixels y1 and y2.
For methods of down sampling other than the four point IDCT, the Vetro paper describes how to determine the optimum tap values for the up sampling filter. This up sampling can also be combined with the linear interpolation and down sampling operations to form a single prediction filter.
The down convertor 50, while generally adequate for progressive pictures with frame DCT coded blocks, does not address problems that arise when attempting to down convert sequences of interlaced pictures with mixed frame and field DCT coded blocks. These problems arise with respect to vertical down sampling and vertical prediction filtering.
Let it be assumed that horizontal down sampling is performed in the DCT domain using a four point horizontal IDCT. Vertical down sampling may also utilize a four point IDCT or some other method. For field pictures, the vertical down sampling operation is then performed on incoming field coded blocks. For frame pictures, the vertical operation is performed on a mix of field and frame coded blocks. Thus, reference pictures may have been down sampled on a field basis, a frame basis, or a mix of both. As previously explained, low drift prediction filtering may be derived from the down sampling filter. If different reference pictures are down sampled differently, they will require different matching prediction filters.
Worse yet is the case of a reference picture containing a mix of field and frame down sampled blocks. A given required prediction may overlap both types of blocks. This complication may be resolved by converting all incoming blocks to either frames or fields before down sampling. This conversion will result in a consistent vertical structure for reference pictures so that the same prediction filter can always be used.
It has been suggested that all incoming pictures be converted to frames before performing vertical down sampling (see xe2x80x9cFrequency Domain Down Conversion of HDTV Using Adaptive Motion Compensation,xe2x80x9d by Vetro, Sun, Bao and Poon, ICIP ""97). Conversion to frames before performing vertical down sampling will result in better vertical resolution than would field based down sampling. However, frame based down sampling requires additional memory in the decoder because a first field must be stored when received in order to allow the second field to arrive so that frame blocks may be formed. Also, severe blocking artifacts in motion sequences may occur (see xe2x80x9cFrequency Domain Down Conversion of HDTV Using an Optimal Motion Compensation Scheme,xe2x80x9d by Vetro and Sun, Journal of Imaging Science and Technology, August 1998).
An alternative, that would avoid these problems and that is suggested in the latter paper, is to convert all incoming pictures to fields before performing vertical down sampling. Therefore, the present invention described herein always uses field based processing such that incoming blocks which are frame coded are first converted to fields before vertical down sampling.
Also, it is well known that four point IDCT down sampling may cause visible artifacts. For progressive pictures, the degree of visibility is usually acceptable for both horizontal and vertical processing. For interlaced pictures using field based vertical down sampling, however, these artifacts may be much more visible. Thus, the present invention implements a technique other than the four point IDCT for its field based vertical down sampling (a four point IDCT is still employed for horizontal down sampling).
In accordance with one aspect of the present invention, a method of down converting received frame and field DCT coded blocks to reconstructed pixel field blocks is provided wherein each of the received frame and field DCT coded blocks contains Nxc3x97N values. The method comprises the following steps: a) converting the received frame DCT coded blocks to converted field DCT coded blocks; b) performing a horizontal M point IDCT, a vertical N point IDCT, vertical spatial filtering, and down sampling on the received field DCT coded blocks and on the converted field DCT coded blocks in order to produce residual and pixel field blocks as appropriate, wherein at least the vertical spatial filtering and down sampling encompasses more than N points, and wherein N greater than M; and, c) adding prediction reference pixels to the residual field blocks, as appropriate, in order to form reconstructed pixel field blocks.
In accordance with another aspect of the present invention, a method of decoding a received first DCT coefficient block to a reconstructed field pixel block comprises the following steps: a) applying a vertical operator and a horizontal operator to the first DCT coefficient block in order to produce intermediate residual or pixel values, wherein the vertical operator is applied concurrently to the first DCT coefficient block and to coefficients in second and third DCT coefficient blocks, wherein the second DCT coefficient block is above the first DCT coefficient block, wherein the third DCT coefficient block is below the first DCT coefficient block, and wherein the horizontal operator is applied to the first DCT coefficient block but not concurrently to the second and third DCT coefficient blocks; and, b) adding prediction reference pixels to the intermediate residual values, as appropriate, to form reconstructed pixels.
In accordance with still another aspect of the present invention, a method of decoding a received first DCT coefficient macroblock, having frame DCT coded blocks, to reconstructed field pixel blocks comprises the following steps: a) applying a vertical operator and a horizontal operator to the first DCT coefficient macroblock in order to produce intermediate residual or pixel values, wherein the vertical operator is applied concurrently to the first DCT coefficient macroblock and to coefficients in second and third DCT coefficient macroblocks, wherein the second DCT coefficient macroblock is above the first DCT coefficient macroblock, wherein the third DCT coefficient macroblock is below the first DCT coefficient macroblock, and wherein the horizontal operator is applied to each block of the first DCT coefficient macroblock but not concurrently to the second and third DCT coefficient macroblocks; and, b) adding prediction reference pixels to the intermediate residual values, as appropriate, to form reconstructed pixels.
In accordance with yet another aspect of the present invention, an apparatus arranged to reconstruct pixels from a target DCT coefficient macroblock comprises a vertical operator, a horizontal operator, and an adder. The vertical operator has sufficient size to be applied concurrently to the target DCT coefficient macroblock and an adjacent DCT coefficient macroblock. The horizontal operator is arranged to horizontally filter the target DCT coefficient macroblock in order to produce intermediate pixel values in conjunction with the vertical filter. The adder is arranged to add prediction reference pixels to the intermediate pixel values in order to form reconstructed pixels.