1. Field of the Invention
The present invention relates to the transmission and storage of images, and, in particular, to interframe video encoding and decoding.
2. Description of the Related Art
Image processing generally employs two compression techniques: intraframe compression and interframe compression. Intraframe compression compresses the information within a single image, and includes techniques such as the discrete cosine transform. Interframe compression exploits the temporal redundancy between sequential image frames. Frequently, parts of an image in one frame are merely a translation in the x-y plane of the same image portion from a previous frame. Accordingly, the position of the translated portion can be communicated by transmitting the previous frame along with a motion vector specifying the translation of that portion. By not transmitting the entire second frame, such a system substantially reduces the number of bits that must be transmitted.
All the video coding standards, including H.261, MPEG-1, MPEG-2, H.263, and very likely the new MPEG-4 standard, employ motion predictive interframe coding to remove temporal redundancy. The MPEG standards employ three types of pictures: intrapictures (I), predicted pictures (P), and bidirectionally interpolated pictures (B). Intrapictures generally serve as reference frames with only moderate compression. Predicted pictures are coded with reference to a past picture, an intrapicture or another predicted picture, and are generally used as a reference for future predicted pictures. Bidirectional pictures provide the highest amount of compression, but require both a past and a future reference for prediction. Bidirectional pictures are usually not used as a reference.
Motion compensation ("motion estimation") is explained with reference to FIGS. 1 and 2. FIG. 1 illustrates a current (second) frame 12 that is to be predicted using a previous (first) frame 10. The first image 10 may or may not immediately precede the second image 12. FIG. 2 illustrates a conventional encoding system 20.
The encoding system 20 receives the first and second video image frames 10 and 12, respectively, and generates motion vectors to encode the second image 12. The images are stored in an encoder frame memory 22. In a motion estimator 36, the video signals of the second frame 12 are compared to the video signals of the first frame 10 to determine the location of portions of the first frame 10 that correspond to portions of the second frame 12.
Because the motion vector informs a conventional decoding system 40 where to find a particular block within the first image 10, the first image 10 must be transmitted as a reference image to the conventional decoding system 40. Before transmission, the first image 10 is compressed by performing a number of functions on a block-by-block basis (I-frame coding).
The conventional encoding system 20 receives the first video image frame 10. A transformer 24 intraframe transforms the first video image 10. The transformer 24 uses standard transformation techniques, such as the discrete cosine transform (DCT). A quantizer 26 quantizes the output of the transformer 24. The output of the quantizer is variable length coded by a variable length coder 28. In turn, the output of the variable length coder 28 is input to a bit stream generator 30, which outputs an encoded bit stream.
The output of the quantizer 26 is also input into an inverse quantizer 32. The inverse quantizer output is inverse transformed by an inverse transformer 34. The output of the inverse transformer 34 is stored in the frame memory 22. All blocks of the image frame are sequentially stored in the frame memory 22.
When the second frame 12 is input to the conventional encoding system 20, a motion estimator 36 produces motion vectors determining blocks in the frame memory 22 which most closely match blocks in the second image frame 12. The motion estimator 36 compares the pixels of a selected block 11 to the pixels of a corresponding, but larger, search area 13 within the first frame 10 to determine the block of the first frame 10 that most closely matches the selected block 11 of the second frame 12. The match may be determined using standard pattern matching techniques. If a match is indicated, the location of the matched block ("the motion compensation block") within the search area relative to the location of the block selected from the second image 12 provides a motion vector indicating the displacement of the current block with respect to the previous block. Once they are calculated by the motion estimator 36, the motion vectors are sent to the bit stream generator 30, where they are converted into a bit stream and output from the encoding system 20.
Additionally, an adder 38 subtracts the motion compensation block, selected from frame memory 22 by the motion vector, from an actual image block in the second image frame 12. A resulting error block is input to the transformer 24. A transformed error block output by the transformer 24 is input to the quantizer 26. The quantizer 26 outputs a transformed and quantized motion error block to the variable length coder 28. Subsequently, the bit stream generator 30 produces a bit stream resulting from the variable length coder 28 acting on that input.
The transformed and quantized error block is also input to the inverse quantizer 32. An inverse transformer 34 applies to that output the inverse of the transform applied by the transformer 24. An adder 41 combines the error block with the motion compensation block selected from the frame memory to form a reconstructed P block resembling the original block in the second image frame 12 which was input to the conventional encoding system 20. This reconstructed P block is stored in the frame memory 22 for future image predictions.
The MPEG standard also supports bidirectionally predicted pictures. For example, assume that successive frames 1 2 3 4 are to be transmitted, where frame 1 is the I picture, frames 2 and 3 are B pictures, and frame 4 is a P picture. Frame 4 is predicted as described above by calculating one motion vector and error image per block with respect to frame 1. Frames 2 and 3 are bidirectionally predicted so that they incorporate information from both past (e.g., frame 1) and future frames (e.g., frame 4).
Two motion vectors and one error block are transmitted for each bidirectionally predicted block. The first motion vector for frame 2, in this example, is the motion vector computed with respect to I frame 1. The second motion vector is calculated with respect to P frame 4. The two motion vectors are used to generate two predicted motion compensation blocks for B frame 2. The two predicted blocks calculated with respect to frames 1 and 4 are averaged together to generate an average predicted block. The difference between the average predicted block and the corresponding actual block from B frame 2 represents the error block for B frame 2.
B frame 3 is compressed in a similar manner by calculating two motion vectors with respect to I frame 1 and P frame 4, averaging the two predicted blocks and computing an error image with respect to the I and P frames. The information derived from these four frames is transmitted by the conventional encoding system 20 in the following order: I frame 1, P frame 4, B frame 2, B frame 3, or more specifically on a block basis: I frame 1, P frame 4 motion vector and error block, B frame 2 motion vectors and error block, B frame 3 motion vectors and error block. Note that if the B frames are predicted from two P frames (i.e. P.sub.1 B.sub.2 B.sub.3 P.sub.4), the frame information would be transmitted as follows: P.sub.1 P.sub.4 B.sub.2 B.sub.3. Further, those skilled in the art will recognize that there are many ways of encoding the B frames, including: intracoded with no motion vectors, forward predicted and backward predicted (the latter two requiring only one motion vector).
Referring to FIG. 3, a conventional decoding system 40 receives a bit stream in the format output by the encoding system 20. This bit stream is parsed by a bit stream parser 42. When a formatted or coded first frame 10 is received by the conventional decoding system 40, its coded blocks are sent to a variable length coding (VLC) decoder 46. Each block of the first image 10 is decoded by the VLC decoder 46 and output to an inverse quantizer 48. Subsequently, an inverse transformer 50 performs inverse transformations on an output of the inverse quantizer 48. The inverse transformer 50 performs an inverse transform to invert the transformation performed by the transformer 24. The inverse transformer produces a reconstructed first frame block. The reconstructed block is output from the conventional decoding system 40 and also stored in a frame memory 44.
After all of the bits in the bit stream corresponding to the first image frame blocks are decoded by the conventional decoding system 40, the conventional decoding system 40 receives bits corresponding to the motion vector and error block for the second image frame 12. If the bit stream parser 42 determines that information in the bit stream corresponds to a motion vector, the motion vector or motion vector information is sent to the frame memory 44. The motion vector determines what block in frame memory 44 is required to predict a block in the second image frame 12.
When the bit stream parser 42 parses an error block for the second image frame 12, that information is sent to the VLC decoder 46, followed by the inverse quantizer 48 and the inverse transformer 50. An adder 52 combines the resulting decoded error block with the block selected by the motion vector and retrieved from the frame memory 44. The adder 52 thus produces a reconstructed block for the second image frame 12. The reconstructed block is then outputted by the conventional decoding system 40 and stored in the frame memory 44 for future decoding.
In order to calculate the B frames, the P frame must also be stored in frame memory 44, as above. The first motion vector for a B frame selects, in this example, a predicted block from the stored I or P frame. The second motion vector selects a predicted block from the stored P or I frame. These two predicted blocks are added together and divided by 2 to calculate an average bidirectionally predicted block. Those skilled in the art will recognize that a bidirectionally predicted block may also be interpolated from two successive P frames. The bidirectionally interpolated block is then added to the error block for the B frame to generate a reconstructed block for the B frame. This process is continued for all the blocks in the B frame.
The above-described techniques require storage of a full frame in frame memory to compute each P frame, and two frames to compute each B frame. The cost of memory predominates in the cost of conventional MPEG-2 decoders. Pearlstein et al. and Bao et al. have considered a low-cost HDTV down-conversion decoder that decodes the HDTV bitstream and converts it to a standard-definition television bitrate. See L. Pearlstein et al. "An SDTV Decoder with HDTV Capability: An All Format ATV Decoder", 137th SMPTE Proceedings, Sep. 6-9, 1995, PP. 422-434, and J. Bao et al. "HDTV Down-Conversion Decoder", International Conference on Consumer Electronics, 1996. The common theme of these approaches is downsampling the reference frame for storage and upsampling the frames when they must be used in calculations. This approach leads to a serious drawback called prediction drift. Because downsampling discards much information, the motion prediction loop in the conventional decoding system 40 cannot keep track of the motion prediction loop in the conventional encoding system 20. The error accumulates, and the picture blurs as the predicted frames are further away from the intra-coded frame. This leads to a pulsing artifact as the picture deteriorates between two intra-coded frames and then suddenly becomes clear again when the next intra-coded frame is reached.
Alternative proposals suggest the use of a "sprite." A sprite is a large reference image that is often the background of a scene. It can be static or dynamic. Alternatively, Long-Term Frame Memory (LTFM) employs an extra frame memory to store a frame (perhaps the first frame) after a scene change. This frame is used as an extra reference frame for motion compensation. Both methods have been reported to result in significant coding efficiency improvement. However, the significant increase in cost from extra memory may be a critical obstacle for these techniques to be practical.
Accordingly, it is desired to provide an interframe coding technique that minimizes the use of frame memory while at the same time maintaining high picture quality.