1. Field of the Invention
The present invention relates generally to video compression algorithms and circuits in digital imaging and video systems and more specifically to circuitry for supporting motion estimation processing for a sequence of video frames.
2. Description of Related Art
Video consists of a series of still images. For still images, the image data tends to have a high degree of spatial redundancy. Capturing the movement of a three dimensional object through time in a video sequence requires an analysis of this spatial redundancy. In a first image of a sequence, a spatial projection of an object is captured in a first region of the first image. Since the projection comprises pixels from the object, correlation within the image is expected. If the object is moving during a sequence of images, it will yield a spatial projection in the next image in a second region. Thus, a high degree of temporal redundancy between neighboring images is also expected. That is, there is a strong correlation between pixels in the first region in the first image and pixels in the second region in the next image. The goal of video compression algorithms is to exploit both the spatial and the temporal redundancy within an image sequence for optimum compression.
Most video coders use a two-stage process to achieve good compression. The first stage uses a method that exploits temporal redundancy between frames. The output of this stage is followed by a coding method that exploits spatial redundancy within each frame. One might expect that an ideal processor for reducing temporal redundancy is one that tracks every pixel from frame to frame. However, this is computationally intensive, and such methods do not provide reliable tracking due to the presence of noise in the frames. Instead of tracking individual pixels from frame to frame, video coding standards provide for tracking of information for pixel regions called macroblocks. Macroblocks are typically 16 pixels by 16 pixels or 8 pixels by 8 pixels in size to provide a good compromise between providing efficient temporal redundancy and requiring moderate computational requirements.
Two contiguous frames in a video sequence can be denoted frame (txe2x88x921) and frame (t). In the first stage, frame (t) is segmented into nonoverlapping 16xc3x9716 or 8xc3x978 macroblocks and for each macroblock in frame (t), a corresponding macroblock is determined in frame (txe2x88x921). Using the corresponding macroblock from frame (txe2x88x921), a temporal redundancy reduction processor generates a representation for frame (t), called a difference frame, that contains only the changes between the two frames. If the two frames have a high degree of temporal redundancy, then the difference frame has a large number of pixels that have values near zero. Otherwise, if frame (t) is very different than frame (txe2x88x921), then the temporal redundancy reduction processor may fail to find corresponding regions between the two frames.
The output of the first stage is a difference frame in which pixels are spatially correlated. Thus, one can use a processor that can exploit this spatial redundancy and yield a compressed representation for frame (t). Well known video coding standards use discrete cosine transformation (DCT) coding methods for reducing spatial redundancy.
The process of computing changes among successive frames by establishing correspondence between the frames is called temporal prediction with motion compensation. Motion compensation is the process of compensating for the displacement of moving objects from one frame to another. If a temporal redundancy reduction processor employs motion compensation, its output can be expressed as e(x, y, t)=I(x, y, t)xe2x88x92I(xxe2x88x92u, yxe2x88x92v, txe2x88x921), where I(x, y, t) are pixel values at spatial location (x, y) in frame (t) and I(xxe2x88x92u, yxe2x88x92v, txe2x88x921) are corresponding pixel values at spatial location (xxe2x88x92u, yxe2x88x92z) in frame (txe2x88x921). The relative motion of a macroblock from one frame to another is specified by the coordinates (u, v).
Generally, motion compensation is preceded by motion estimation. Motion estimation is the process of finding corresponding pixels among frames. One of the most compute-intensive operations in interframe coding is the motion estimation process. Given a reference picture in a frame and an N pixel by M pixel block in a current picture, the objective of motion estimation is to determine the N pixel by M pixel block in the reference picture that better matches (according to given criterion) the characteristics of the block of the current picture. The current picture is an image or frame at time t. As a reference picture, an image or frame at either past time txe2x88x92n (for forward estimation), or at future time t+k (for backward estimation), are defined.
The location of a block region is usually given by the (x, y) coordinates of its top-left corner. The coordinates (x+u, y+v) specify the location of the best matching block in the reference picture. The vector from (x, y) to (x+u, y+v) is called the motion vector associated with the block at location (x, y). If the motion vector is expressed in relative coordinates, the motion vector is simply expressed as (u, v).
Various methods for computing (u, v) are known in the art. Typically, these methods have a high computational complexity and are applied to image data stored in memory within a computer system. The motion estimation calculations are done by a processor one operation at a time by extracting pixels of the frames from memory, computing the differences between them, storing the differences back to memory, and using the differences to determine the motion vector. This process can take hundreds of millions of separate operations just to complete motion estimation on one 640 pixel by 480 pixel frame. This computational cost is prohibitive for many operating environments, frame sizes, and frame rates. The high cost of motion estimation is a limitation on the usefulness and availability of full motion video on computer systems. What are needed are methods and circuitry for high speed, massively parallel motion estimation computations which overcome the limitations of the prior art, thereby allowing video compression schemes to drastically reduce the amount of data required to represent full motion digital video and the overall time required for motion estimation processing.
An embodiment of the present invention is an image sensor array having a plurality of pixels, each one of the plurality of pixels including a first register to store a first pixel value of a first frame, a second register to store a second pixel value of a second frame, the second register corresponding to the first register, and a subtractor coupled to the first and second registers to produce a difference between the first pixel value and the second pixel value.
Another embodiment of the present invention is a pixel of an image sensor array, the image sensor array having a plurality of pixels, each pixel having a first register to store a first pixel value of a first frame, a second register to store a second pixel value of a second frame, the second register corresponding to the first register, and a subtractor coupled to the first and second registers to produce a difference between the first pixel value and the second pixel value.
Another embodiment of the present invention is a method of supporting motion estimation processing in an image sensor array of a video camera. The image sensor array has a plurality of pixels, wherein the plurality of pixels are divided into a plurality of N pixel by M pixel blocks. The steps include capturing a first frame and storing pixel values of the first frame in a plurality of first registers, capturing a second frame and storing pixel values of the second frame in a plurality of second registers, the second registers corresponding to the first registers, subtracting the first registers from the second registers in parallel to produce a plurality of differences, and shifting and accumulating the differences to obtain a total divergence for each block of the image sensor array.