The present invention relates to video compression and decompression, and more particularly to motion compensation/estimation methods and systems incorporating such methods.
Many methods exist to compress images/video, such as JPEG and JPEG2000 for still images and H.26x/MPEGx for video sequences. In addition to compressing individual images, video compression also exploits temporal redundancy by using motion compensation.
In block-based compression a picture (frame, image) is decomposed into (macro)blocks where, typically, each macroblock contains a certain number of 8×8 blocks, depending upon the chroma-format used. For example, in the case of 4:2:0 chroma-format a macroblock is made up of four 8×8 luminance blocks (i.e., 16×16 block of pixels) and two 8×8 chrominance blocks (each located as a subsampling of the 16×16 block of pixels).
Motion compensation relates successive pictures in a sequence by block-based motion estimation and prediction. That is, for a block, BLK, in a current (being encoded) picture motion estimation finds a predictive reference block, BLKref, in a prior picture by searching corresponding-sized blocks of pixels, BLKk, in the prior picture and taking BLKref=arg minBLKK∥BLK–BLKk∥ where the distance measure ∥.∥ can be any convenient metric such as sum-of-absolute-differences (SAD). The reference block can be simply described by the displacement (motion vector, MV) of the location of BLKref in the prior picture from the location of BLK in the current picture. The prior picture should be a reconstruction of a compressed picture so a decoder will produce the correct result; see FIG. 2 which shows motion compensation (MC) with motion estimation (ME) plus reconstruction that includes block transform (e.g. DCT) and quantization (Q) for further compression and variable length coding (VLC). The search computation can be simplified by restriction to a search window, such as a limit on the magnitude of the motion vector.
Also, in the block-based coding methods, such as H.26x/MPEGx, pictures in a sequence are encoded into one of three picture types: I-pictures (intra-coded), P-pictures (predictive), and B-pictures (bidirectional, interpolative). The coding of an I-picture is independent of other pictures (and thus any image coding method could be applied). A P-picture is first predicted from its reference picture (a previous I- or P-picture) using the macroblock-based forward motion estimation and prediction; then the motion-compensated difference picture (residual/texture) plus the associated motion vectors are encoded. A B-picture has two reference pictures (see FIG. 3), and supports three basic modes: forward prediction, backward prediction, and bi-directional prediction which is a pixel-by-pixel average of the forward and backward predictions.
Further, MPEG2, MPEG4, H.26L, . . . also support interlaced pictures. Each interlaced video frame consists of two fields sampled at different times separated by the field period time interval. The lines of the two fields of a frame interleave so that two consecutive lines of the frame belong to alternative fields. The fields are called the top field (TF) and the bottom field (BF); see heuristic FIG. 4. In particular, if the frame lines are numbered from top to bottom and starting with 0, then the top field consists of the even-numbered lines and the bottom field consists of the odd-numbered lines.
To support interlaced video, MPEG-2 provides a choice of two picture structures: frame picture and field picture. In frame picture each interlaced field pair is interleaved together into a frame that is then divided into macroblocks and encoded. In field picture the top and bottom interlaced fields are encoded separately. Frame picture is more common in consumer products.
MPEG-2 also includes several motion compensation (MC) prediction modes. The two MC prediction modes that are primarily used are frame prediction for frame picture and field prediction for frame picture. In frame prediction for frame picture, the motion estimation (ME) of the current (being encoded) macroblock, BLK, is carried out in the search window, SW, of the reference picture; see FIG. 5a. In field prediction for frame picture the current macroblock is split into top-field pixels, BLKT, and bottom-field pixels, BLKB, and ME is carried out separately in the top-field of the search window, SWT, and the bottom-field of the search window, SWB, for both BLKT and BLKB; see FIG. 5b. 
To decide upon the optimal coding mode, MPEG-2 motion estimation always has to compute all five possibilities: frame ME of BLK over SW, field ME of BLKT over SWT and also over SWB plus field ME of BLKB over SWT and also over SWB. The ME finds the best match reference frame/field block (represented by MV, the corresponding motion vector) and the corresponding prediction error. Prediction errors are usually measured by sum-of-absolute-differences (SAD) defined between M×N blocks B1 of picture p1 and B2 of picture p2 as:SAD(B1,B2)=Σ0≦i≦M−1Σ0≦j≦N−1|fp1(x1+i,y1+j)−fp2(x2+i,y2+j)|where fp(w,z) denotes the pixel value at pixel location (w,z) of picture p, and (x1,y1) and (x2,y2) are the locations of the upper left corners of blocks B1 and B2, respectively, so the motion vector is (x2−x1,y2−y1). The x coordinate indicates distance from left to right (column number) and the y coordinate indicates distance from top to bottom (row number).
Using the prediction errors for the five possibilities, the motion estimation decides the motion mode (MM) for the current block, Opt_MM. In the case of a frame picture, MM for the current block can be either frame prediction (FRM_MM) or field prediction (FLD_MM). In the case of Opt_MM equal FLD_MM, motion mode for top field BLKT (Opt_Field_MMT) can be either TOP or BOT indicating that the prediction is picked from SWT or SWB, respectively. Similarly, motion mode for bottom field BLKB(Opt_Field_MMB) can also be either TOP or BOT. Note that the motion estimation computations can be full search (examining all possible reference block locations in SW, SWT, or SWB) or any other form of simplified or limited search. Thus for (macro)block BLK in frame picture structure encode the values of variables Opt_MM and, if needed, Opt_Field_MMT plus Opt_Field_MMB together with the corresponding motion vector(s), MVFRM or MVT plus MVB.
The motion estimation is usually the most computationally expensive portion of video encoding. However, different motion estimation designs are allowed in video compression, and they have significant impacts on the cost and quality of the end product. It is a problem to perform motion estimation with low complexity computations while maintaining sufficient picture quality.
An 8×8 DCT (discrete cosine transform) or wavelet or integer transform may be used to convert the blocks of pixel values into frequency domain blocks of transform coefficients for energy compaction and quantization; this permits reduction of the number of bits required to represent the blocks. FIG. 2 depicts block-based video encoding using DCT and motion compensation. The DCT-coefficients blocks are quantized, scanned into a 1-D sequence, encoded by using variable length coding (VLC), and put into the transmitted bitstream. The frame buffer contains the reconstructed prior frame used for reference blocks. The motion vectors are also encoded and transmitted along with overhead information. A decoder just reverses the encoder operations and reconstructs motion-compensated blocks by using the motion vectors to locate reference blocks in previously-decoded frames (pictures).