1. Field of the Invention
The present invention relates generally to the temporal compression of digital video data by motion compensation. More specifically, the present invention relates to the encoding and decoding of motion vectors used to predict a new video frame by translating constituent portions of a reference video frame.
2. Description of the Related Art
With the rapid growth of digital media in the marketplace, the need to develop more efficient and more accurate methods for compressing the attendant large data files continues to receive much attention. Digital video data in particular require extensive storage space and large bandwidth for remote transmissions. A video sequence is comprised of individual frames that are arrays of pixels with color values associated to each pixel. For example, each frame might be a 720 by 480 array of pixels with component values for each of three colors (red, green, blue) ranging between 0 and 255 at each pixel. Since 8 bits are required to express each color value, if this sequence is 30 minutes long and comprises an industry-standard 30 frames per second, the raw digital data for the sequence will take up 3×8×720×480×30×60×30=447,897,600,000 bits or approximately 56 gigabytes, excluding the capacity needed for audio. Given the limited capacity of most portable storage media and the limited bandwidth of many transmission channels, such a video sequence requires significant compression in order to find widespread availability in the marketplace.
Existing video compression strategies seek to reduce the bits required by removing redundancies within the video data. Video data generally contains both spatial and temporal redundancies, where spatial redundancy is due to color similarities within a single frame and temporal redundancy is due to the persistence of some objects or other image features over time and thus across two or more frames. A variety of methods for eliminating spatial redundancies have been introduced, including the techniques established by the JPEG standards body. Existing methods for reducing temporal redundancy involve encoding some subset of a sequence of frames as reference frames and attempting to describe interspersed frames as variations of one or more reference frame. Such methods considerably reduce the amount of information required for the non-reference frames and thus compress the video data beyond what is achievable by simply removing spatial redundancies.
While many of the same objects appear in neighboring frames of a video sequence, the positions of some of these objects may change due to either camera movement or activity within the scene. As a result, an effective means for matching objects between frames must take motion into account. This strategy is commonly referred to as motion compensation. Many existing technologies for temporal compression, including the MPEG-1, MPEG-2, and MPEG-4 standards, compensate for motion by breaking a frame into a grid of square blocks (generally 16×16 pixels or 8×8 pixels) and searching for square blocks in a reference frame that provide the best match for each of these blocks. Other proposed techniques break a frame into a plurality of other constituent parts, or segments, and conduct a similar matching process between a new frame and a reference frame. Since the matching block or segment in the reference frame will often not occupy the same relative position as the block or segment in the new frame due to motion, a displacement vector is used to record the amount of offset in the horizontal and vertical directions. A prediction for the new frame image can be made using only data for the reference frame and a displacement vector, or motion vector, for each block or segment. Since the new frame is unlikely to be perfectly reconstructed by this prediction, a residue or difference between actual data and the prediction must also be recorded. But compression is achieved since encoding both the motion vectors for each block and the residue requires fewer bits than encoding the raw data for the new frame directly.
A variety of techniques have been proposed for subdividing a frame into constituent blocks or segments and for determining motion vectors corresponding to these blocks or segments for the purpose of predicting a new frame using one or more reference frames. See Prakash I, Prakash II, and Prakash III for a more complete discussion of segmentation and motion matching of segments. Once a subdivision into blocks or segments has been carried out and motion vectors providing the most accurate prediction have been determined, an efficient method for encoding the motion vectors must be applied in order to realize the potential gains of this compression technique. While directly coding each motion vector for each block or segment individually may save bits over coding a new frame without temporal compression, many more bits may be conserved by further exploiting correlations among the motions of the plurality of blocks or segments. For instance, if neighboring blocks or segments move in a similar fashion, then there is no need to treat their motion vectors completely independently, and in fact bits may be saved by coding these vectors in a dependent way.
A standard adaptation of the MPEG block-matching technique for generating motion vectors is to predict motion vectors based on known motions of neighboring blocks and to encode an error correction vector. For instance, in a typical encoder/decoder compression system, it is desirable for the encoder to transmit as few bits as possible to the decoder while providing it with sufficient information to reconstruct a close approximation of the original image. Proceeding through the grid of blocks in raster-scan order, the decoder can predict a motion vector for a current block based on the previously coded vector for the neighboring block to the left of the current block. The encoder can perform the same prediction, compute the difference between the actual motion vector and this predicted motion vector, and encode and send the difference only to the decoder. If the neighboring blocks have similar motion vectors, this difference vector is likely to be close to zero and will thus on average consume fewer bits than the actual motion vector for the current block.
Variations on the above strategy for compressing motion vectors for blocks by predicting from neighbors have been proposed. For example, if the coding proceeds through blocks in raster-scan order, then a given block will typically border one block to the left and a plurality of blocks above whose motion vectors have already been coded. The vectors of this plurality of bordering blocks might be averaged to predict a motion vector for the current block. Alternatively, the closest matching vector among these neighboring blocks may be used as a prediction. These predictive techniques have also been used within an MPEG-based macroblock/subblock motion compensation strategy, as seen for instance in U.S. Pat. No. 6,289,049 to Hyun Mun Kim et al. In this strategy, motion matching is carried out first for each 16×16 macroblock in a frame-wide grid, then the resulting vectors are used to narrow the search range for each of four 8×8 blocks comprising a macroblock. Predictions for the 8×8 blocks may then be made with respect to other previously coded 8×8 blocks either within the same macroblock or in adjacent macroblocks.
Some other methods for conserving bits in the coding of motion vectors appear in the related art. U.S. Pat. No. 6,178,265 to Siamack Haghighi discloses a strategy comprised of histogramming all of the motion vectors for a given frame, using the histogram to select a subset of dominant motion vectors that represent clusters of actual motion vectors, and mapping actual motion vectors to the closest dominant motion vector before encoding them. In “Motion-compensated 3-D subband coding with multiresolution representation of motion parameters,” Proc. IEEE Int. Conf Image Processing, Vol. II, Austin, Tex., 1994, pp. 250-254, Jens-Rainer Ohm discusses a multiresolution technique for representing motion vectors. In this paper, after motion vectors have been estimated hierarchically using a control grid structure, they are coded using a Laplacian pyramid structure. U.S. Pat. No. 6,163,575 to Jacek Nieweglowski et al discloses a method for coding motion information in a segment-based motion compensation scheme. This approach employs a linear motion vector field model, which provides several coefficients describing the motion of each segment rather than single motion vectors. Segments are merged and coefficients are dropped whenever possible to conserve bits in coding the motion information.