Motion estimation is the process of determining motion vectors that describe the transformation from one picture to another, usually from adjacent frames in a video sequence. Motion estimation is typically based on an assumption that image values (brightness, color, etc., expressed in a suitable color space) remain constant over time, though their position in the image may change.
In MPEG, The motion vectors may relate to the whole image (global motion estimation) or specific parts, such as rectangular blocks, arbitrary shaped patches or even per each element of the image. The map of all motion vectors (“motion map”) can thus possess a different resolution from the image/frames to which it refers. In case motion estimation calculated a motion vector per each element of the image (e.g., per each pixel of the frame of a video), the motion map (“accurate” or “dense” motion map) will have the same resolution as the image to which it refers.
Motion maps are helpful for a variety of applications.
First, they can notably improve the compression rate of video encoding, since they allow to produce a rendition of a frame based on a previous reference frame already known to the decoder (“motion compensation”), avoiding the need to transmit again the information that can be reused from previous frames: the decoder can generate settings for the given element in the current frame based on settings of the element in the reference frame to which the motion vector points. In fact, basic motion estimation and motion compensation techniques have been employed in conventional video codecs (e.g., MPEG family codecs or other frequency-transform based/block-based codecs) in order to account for movement of an object in a moving picture of multiple sequential frames. For example, using block motion compensation (BMC), the frames can be partitioned into blocks of pixels. Each block B in the current frame can be predicted based on a block B0 of equal size in a reference frame. The position of the block B0 in the reference frame with respect to the position of B in the current frame can be encoded as a motion vector. In such cases, the motion vector indicates the opposite of the estimated x and y movement of the block of pixels (in particular, it indicates the opposite of the movement since it points from B to B0, while the movement is from B0 to B). The motion vector is typically encoded with sub pixel precision (i.e., can specify movements also of fractions of a pixel) because the encoder wants to be able to capture also subtle movements of less than a full pixel. According to MPEG family codecs, the blocks are not transformed other than being shifted to the position of the predicted block, and additional encoded information can indicate differences between block B0 and block B.
In addition to video encoding, there are also many other applications that can benefit from motion estimation, ranging from robotics (a dense motion field can help estimate the z-order of an image, i.e. a z-map associated with the image and making sense of depth) to professional movie post-production/visual effects.
Estimating accurate/dense motion maps is very complex, so conventional motion estimation techniques rely either on block matching (a small region of the current frame is compared with similar sized regions in the reference frame, typically oversampled in order to allow for sub pixel motion estimation, until a vector that minimizes some error criterion is chosen) or on optical flow methods (the image is preprocessed so as to extract a few hundreds of features, then the algorithm tries to identify the precise motion of the features and calculates a dense motion map through interpolation).
Motion maps are just specific examples of what we defined “auxiliary maps”, i.e. maps of auxiliary information that is associated to a signal (which can be a 2D image, a 3D volumetric image, a 3D signal including both space and time-based dimensions, or even a signal featuring more than three dimensions) in a way that for given portions of the signal (e.g., in the case of dense auxiliary maps, for every plane element of the signal) the auxiliary map specifies suitable information and/or meta-information associated with that portion/element. In the case of motion maps, such auxiliary information is represented by the coordinates of the motion vector and by additional meta-information related to the motion vector.
Aside from motion maps, other non-limiting examples of auxiliary maps are z-maps (which provide, for every portion/element of the signal, information relative to the depth of field/distance from the observer), simplified motion fields (which provide simplified information on the motion of every portion/element of the signal, e.g. highly quantized motion information suitable to distinguish between what moves with a motion within a given range of movements vs. what is still or moves with a movement outside of the range), class maps (which provide, for every portion/element of the signal, information relative to what class it belongs to, e.g., distinguishing in medical imaging between plane elements belonging to bones, soft tissues, fluids, metals, etc.), and so forth.
One of the key characteristics of auxiliary maps is that they present fairly homogenous areas separated by sharp discontinuities, and it is often inappropriate to modify their resolution (e.g., obtaining a more accurate map starting from a lower resolution one, or vice versa) by leveraging interpolation techniques or other standard upsampling/downsampling techniques. For instance, in a video it would be inappropriate to define the motion of an element at the transition between two motion zones moving in different ways by means of a motion vector calculated by interpolating the two different motions, since the interpolation would likely lead to a movement that has nothing to do with either of the two movements. In a similar fashion, in a medical image it would be inappropriate to define the value of an element at the transition between a bone and a soft tissue by means of interpolating the two corresponding classes, since the interpolated class would likely have no meaning in that context.