Motion estimation in the known art is the process of determining motion vectors that describe the transformation from one picture to another, usually from adjacent pictures in a video sequence. Motion estimation is typically based on an assumption that image values (e.g., brightness, color, etc., expressed in a suitable color space) remain constant over time, whereas their position in the image may change.
In known methods such as MPEG (Moving Picture Expert Group) methods, motion vectors may relate to the whole image (global motion estimation) or to specific parts, such as rectangular blocks, or even per each element of the image. The map of all motion vectors (“motion map”) can thus possess a different resolution from the image/frames to which it refers. When motion estimation calculates a motion vector per each element of the image (e.g., per each pixel of a frame of a video), the motion map (“accurate” or “dense” motion map) will have the same resolution as the image to which it refers.
Motion maps are helpful for a variety of applications.
First, they can notably improve the compression rate of video encoding, since they allow to produce a rendition of an image based on a reference (e.g., in known methods, a previous reference image of the same sequence) already known to the decoder (“motion compensation”), avoiding the need to transmit again the information that can be reused from previous images: the decoder can generate settings for the given element in the current image based on settings of the element in the reference image to which the motion vector points. In fact, basic motion estimation and motion compensation techniques have been employed in conventional video codecs (e.g., MPEG family codecs or other frequency-transform based/block-based codecs) in order to account for movement of an object in a moving picture of multiple sequential images.
For example, using block motion compensation (BMC), the images is partitioned into blocks of elements (“pixels”). Each block B in the current image is predicted based on a block B0 of equal size in a reference image. The position of the block B0 in the reference image with respect to the position of B in the current image (“offset”) is typically encoded as a motion vector with two coordinates. In these cases, the motion vector indicates the opposite of the estimated x and y movement of the block of pixels (in particular, it indicates the opposite of the movement since it points from B to B0, while the movement is from B0 to B). The motion vector is typically encoded by using two integer coordinates with sub pixel precision (i.e., can specify movements also of fractions of a pixel, typically in steps of ¼ of a pixel) because the encoder wants to be able to capture also subtle movements of less than a full pixel. According to MPEG family codecs, the blocks are not transformed other than being shifted to the position of the predicted block, and additional encoded information can indicate differences between block B0 and block B.
In addition to video encoding, there are also many other applications that can benefit from motion estimation, ranging from robotics (a dense motion field can help identify objects and/or estimate the z-order of an image, i.e. a z-map associated with the image and making sense of depth) to professional movie post-production/visual effects.
Estimating accurate/dense motion maps that describe the motion of each image element is very complex, so conventional motion estimation techniques try to limit both the computational load and the amount of information required to describe motion. State of the art techniques are usually based on either block matching methods or on optical flow methods.
In block matching methods (typically aimed at applications that require very fast processing and limited amount of motion information, such as video encoding), a small square region of the current image is compared with similar sized regions in the reference image, which is typically oversampled in order to allow for sub-pixel motion estimation, until an offset motion vector that minimizes some error criterion is chosen.
In optical flow methods (typically aimed at applications that require precise description of motion even at the expense of speed and amount of motion information, such as special effects and video editing), the image is preprocessed so as to extract a number of features; then the algorithm tries to identify the precise motion of the features and calculates a dense motion map (i.e., one offset motion vector per each image element) through interpolation,
Known encoding techniques based on block motion compensation and on offset motion vectors using integer coordinates (i.e., coordinates with fixed precision, such as ⅛th of a pixel) have several important drawbacks, suitably addressed by novel methods described herein. First, the borders of moving object are poorly described by blocks, generating artifacts that must be corrected with residual data (or that corrupt the rendition of the image obtained via motion compensation). Second, the use of offset coordinates with a given sub-pixel precision typically requires to buffer an upsampled rendition (e.g., a very high resolution version) of the reference image at the given sub-pixel resolution: as a consequence, capturing very subtle movements (e.g., 1/128 of a pixel, important for instance in the case of high frame-rate video signals or in the case of complex movements such as a 1% zoom with 2-degree rotation) is not feasible due to memory limitations. Third, in the case of large objects with a consistent movement (e.g., a large background), a degree of waste of bit-rate is necessary due to the need to encode and transmit multiple correlated (and not necessarily identical) motion vectors. Lastly, these well-known methods are unable to cope very well with more complex movements (e.g., like rotation, zoom, perspective changes, etc.), which are imperfectly defined by translation movements of blocks.
Motion maps are just specific examples of what we defined as “auxiliary maps”—i.e. maps of auxiliary information that is associated to a signal—in a way that for given portions of the signal (e.g., in the case of accurate/dense auxiliary maps, for every plane element of the signal) the auxiliary map specifies suitable information and/or meta-information associated with that portion/element. The signal can be without limitation an audio signal, a 2D image, a 3D volumetric image, a 3D signal including both space and time-based dimensions, or even a signal featuring more than three dimensions. In the case of motion maps for video, this auxiliary information corresponds to the information on motion of each portion of the image and to additional meta-information related to the motion vector (e.g., confidence level, statistical precision, etc.).
Aside from motion maps, other non-limiting examples of auxiliary maps are z-maps (which provide, for every portion/element of the signal, information relative to the depth of field/distance from the observer), simplified motion fields (which provide simplified information on the motion of every portion/element of the signal, e.g. highly quantized motion information suitable to distinguish between what moves with a motion within a given range of movements vs. what is still or moves with a movement outside of the range), class maps (which provide, for every portion/element of the signal, information relative to what class it belongs to, e.g., distinguishing in medical imaging between plane elements belonging to bones, soft tissues, fluids, metals, etc.), and so forth.
One of the key characteristics of auxiliary maps is that they present fairly homogenous areas separated by sharp discontinuities, and it is often inappropriate to modify their resolution (e.g., obtaining a more accurate map starting from a lower resolution one, or vice versa) by leveraging interpolation techniques or other standard upsampling/downsampling techniques. For instance, in a video it would be inappropriate to define the motion of an element at the transition between two objects moving in different ways by means of a motion vector calculated by interpolating the two different motions, since the interpolation would likely lead to a movement that has nothing to do with either of the two movements. In a similar fashion, in a medical image it would be inappropriate to define the value of an element at the transition between a bone and a soft tissue by means of interpolating the two corresponding classes, since the class corresponding to the interpolated value would likely have no meaning in that context.