Next-generation high-resolution display devices will have a display resolution that likely exceeds even the high-definition video formats now prevalent, such as 1080i, 720p, and even 1080p. There will thus likely be the need for spatial upconversion and/or format conversion techniques to display content on such next-generation devices, as well as noise reduction and image enhancement techniques. In addition, these devices may well utilize frame rates that exceed those used in existing display devices, highlighting the need for temporal upconversion as well.
Both spatial and temporal upconversion, as well as noise reduction and image enhancement, benefit from accurate motion estimation. Techniques that use motion estimation have been shown to outperform those that simply use single-frame image processing methods. Motion estimation for purposes of motion-compensated temporal video filtering or upconversion, as just described, is more rigorous than motion estimation used for video compression, e.g. MPEG, H.264 in that the goal of motion-compensated temporal video filtering or upconversion, it is important to estimate the “true motion”, i.e. the two-dimensional flow arising from the projection of three-dimensional motion in the scene. In other words, whereas video coding standards are merely concerned with finding an optimal motion vector that minimizes residual information that needs to be encoded, the goal of motion-compensated temporal video filtering or upconversion is to find a motion vector that corresponds to actual motion (translational, rotational, etc) in a frame.
Though motion vectors may relate to the whole image, more often they relate to small regions if the image, such as rectangular blocks, arbitrary shapes, boundaries of objects, or even individual pixels. Motion vectors may be represented by a translational model or many other models that approximate the motion of a real video camera, such as rotation, translation, or zoom. There are various methods for finding motion vectors. One of the popular methods is block-matching, in which a frame is subdivided into rectangular blocks of pixels, such as 4×4, 4×8, 8×8, 16×16, etc., and a motion vector (or displacement vector) is estimated for each block by searching for the closest-matching block, within a pre-defined search region, of a subsequent or preceding frame. Block-matching algorithms make use of certain evaluation metrics such as mean square error (MSE), sum of absolute difference (SAD), sum of square difference (SSD), etc. to determine whether a given block in reference frame matches a search block in a current frame. A reference image block is found to be a matching block by applying a motion vector with integer-pixel accuracy or sub-pixel accuracy. Different searching strategies such as cross search, full search, spiral search, or three-step search may also be utilized to evaluate possible candidate motion vectors over a predetermined neighborhood search window to find the motion vector.
Numerous variations of this method exist, which may differ in their definition of the size and placement of blocks, the method of searching, the criterion for matching blocks in the current and reference frame, and several other aspects. Methods based on block matching are prevalent in practical video processing applications. However, a major challenge for practical implementation in hardware products is the large computational cost associated with most basic full-search block matching methods. A large number of prior art methods are focused on reducing the search space for block matching in order to reduce computational cost. The high computational cost of search-based block matching continues to be a problem.
In particular, given the prevalence of the block-matching technique in which pixels of an image are grouped together into regions as small as 4×4 pixels, high-quality encoding techniques may define motion vectors at a sub-pixel resolution. For example, a motion vector associated with a 4×4 pixel block (the highest block resolution) would be able to distinguish motion at a single-pixel resolution in the actual image. Alternatively, if a larger block size were used, say an 8×8 block, single pixel resolution for defining motion between blocks could require eighth-pixel accuracy or greater in a motion vector. As can easily be appreciated, block matching motion vector compression techniques require high computational cost when sub-pixel estimation is used.
Theoretical and experimental analyses have established that sub-pixel accuracy has a significant impact on the performance of motion compensation. Sub-pixel accuracy mainly can be achieved through interpolation. Various methods of performing interpolative up-sampling in the spatial domain or frequency domain have been proposed over the years. One major concern of implementing interpolative sub-pixel methods, however, is the computation cost. For example, to achieve one-eighth pixel accuracy, an image-processing system needs to handle the storage and manipulation of data arrays that are 64 times larger than integer-pixel motion estimation.
Gradient-based motion estimation is another important class of motion estimation methods. In gradient-based motion estimation, local motion is estimated using local spatial and temporal image derivative values (gradients) in a local analysis window which together correlate with motion in an image. Gradient-based methods have not been used frequently in practical applications. One reason may be that gradient-based methods are often applied on a pixel-by-pixel basis, to estimate a dense optical flow field. Also, most gradient-based optical flow estimation techniques involve iterative optimization involving all the image data across entire frames. Such algorithms pose computational challenges that are intractable for practical hardware implementation. Another challenge with basic gradient-based techniques is that they are only suitable to estimate small motion vectors (or small displacements). Hence, coarse-to-fine strategies, as well as iterative optimization, are often invoked.
Motion estimation is a very challenging problem for other reasons as well. The assumptions used in many motion models do not hold exactly at all image locations. For example, a basic assumption is that the color or brightness of a pixel or block of pixels is preserved from one video frame to the next. Another well-known problem is that the data may not sufficiently constrain the motion model to arrive at a reliable solution. Another well-known problem is formed by occlusions, areas in one image that do not appear in the other image. Another well-known problem is that of the noise, such as camera noise or compression noise. Hence, there is a strong need for robustness.
To overcome some of these challenges, it is often beneficial to utilize the concept of spatial consistency or coherency, which states that real-world surfaces have a spatial extent and areas on a single surface are likely to be moving with the same or very similar motion. The spatial extent of object surfaces is often larger than the extent of the single pixel or pixel block for which motion has to be estimated. Therefore, local motion vectors that model the motion of single pixels or small pixel blocks are often similar to their neighboring motion vectors. This leads to the introduction of the well-known motion smoothness constraint, used very commonly in prior art methods. However, the assumption of spatial consistency does not hold at motion boundaries, which often coincide with object boundaries. This often leads to motion fields that are overly smooth at object boundaries. Recent approaches have used more advanced forms of the spatial smoothness constraint that allow breaking smoothness at motion boundaries. Other approaches are robust estimation techniques, which allow for multiple motions to exist in areas of the image where smoothness would otherwise be enforced.
Likewise, it is often beneficial to utilize the concept of temporal consistency or coherency, which states that real-world objects have inertia, and that their motion may not change significantly from one video frame to the next. Therefore, motion vectors that were estimated in previous frames may be of significant help in estimating a motion vector in a new frame. This assumption can be incorporated into a motion estimation algorithm in various ways. The most common technique in existing practical applications is simply to use motion vectors in previous frames as predictors for estimation of a motion vector in the current frame. Subsequently, the predicted motion vector is updated based on block matching, using a local search technique in a restricted area.
Therefore, there is a still a need for more robust, efficient, and accurate methods for estimating motion vector fields for video processing.
The foregoing and other objectives, features, and advantages of the invention will be more readily understood upon consideration of the following detailed description of the invention, taken in conjunction with the accompanying drawings.