Motion estimation for video sequences is typically performed using a block motion model. Most video standards use a translational block motion model, wherein a block in the current frame is correlated with pixels in the next frame corresponding to possible translated positions of the block. A search for the best matching block in the next frame is performed. The vector displacement of the identified best matching block in the next frame, relative to the location of the corresponding block in the current frame, represents the block motion.
Block motion models other than translational block motion models have been proposed to compensate for inter-frame object rotation and perspective effects. These other block motion models are more accurate than translational block motion models, because a larger parameter space is searched to account for block shape changes in addition to block translation. The transformation parameters are directly obtained from the best matching shape (in terms of minimization of block prediction error). However, these other block motion models require more parameters as compared to a simple translational block motion model, which requires just two parameters.
With parametric block matching, shape distortions in the reference frames are related by a parametric transformation to the block to be matched in the current frame. However, parametric block matching motion estimation methods ignore the geometrical relationships that exist in the case of a calibrated multiple view image sequence. An example of a calibrated multiple view image sequence is an image sequence captured by a pre-calibrated camera of a rigid object on a rotating turntable. Another example is an image sequence captured by multiple calibrated cameras of the same static scene. These image sequences differ from general video sequences in that the objects/cameras/images are related by a known geometry.
Motion estimation methods that do not rely upon parametric block searching have been devised in order to take advantage of the known geometric relationships of multiple view image sequences to achieve improved compression performance. These methods take advantage of the fact that the displacement of a point from one view to the next depends only on its depth, once the internal and external camera parameters are known. These methods typically partition the frame to be predicted into square blocks. If a block in an intermediate view is assumed to have a constant depth Zblock, then by varying Zblock, displacement locations of the given block within a reference view are obtained.
The depth parameter Zblock that leads to the best match is selected as the motion descriptor for that block. However, the assumption that all pixels within a block have the same depth limits the accuracy of the model.