Motion estimation, the identification of motion in a sequence of images, frames or video fields is well known. Existing methods of motion estimation typically consider two or more frames from a sequence and create a set of motion vectors that represents the 2D translational motion of image areas from one frame to the next. One possible technique for motion estimation is a motion search, in which a previous frame is searched to find an area of the image that best matches a particular area in the current frame. The difference in the position of the two matching areas gives the motion vector at the current position.
Different systems have different requirements of the motion estimator. In a compression system such as a video encoder, the requirement is to form the most compact representation of a frame, by reference to a previous frame from the sequence. The requirement is generally to find motion vectors which give the best matches between areas of pixels in the current frame and the reference frame, as this leads to the most compact encoding. While the resulting motion vectors are usually representative of the actual motion of objects in the scene, there is no requirement that this is always the case. In other applications, such as object tracking or frame rate conversion, it is more important that the motion vectors represent the true motion of the scene, even if other distortions in the video mean that the pixels in the corresponding image areas are not always the best possible match to each other. By applying appropriate constraints during the motion search procedure, the results can be guided towards “best pixel match” or “true motion” as necessary. Collectively, the set of motion vectors in a frame is known as the motion vector field for that frame. Note that use of the term “vector field” should not be confused with the use of “field” or “video field” to describe the data in an interlaced video sequence, as described below.
While many approaches to motion estimation exist, a common implementation is that of a block based motion estimator. The invention disclosed in this patent will be described by showing how it can be used with a block based motion estimator, although the principles of the invention may also be applied to motion estimators of other types. In a block based motion estimator, frames are subdivided, typically into a regular grid of rectangular areas known as blocks or macroblocks. In a motion search procedure, each block's pixel data is compared with pixel data from various candidate locations in the previous frame and a scoring function is computed for each candidate. The relative positions of the blocks with the best score gives the motion vector at the current block position.
FIG. 1 illustrates a typical example of a block matching motion estimator. In all the figures, including FIG. 1, motion vectors are shown with the head of the arrow at the centre of the block to which the vector corresponds. The frames are divided into blocks, and an object 101 in the previous frame has moved to position 102 in the current frame. The previous position of the object is shown superimposed on the current frame as 103. Motion estimation is performed for blocks rather than for objects, where a block of pixels in the current frame is matched with a block sized pixel area in the previous frame which is not necessarily block aligned. For example, block 104 is partially overlapped by the moving object 102, and has contents as illustrated at 105. Motion estimation for block 104, if it performs well, will find the pixel data area 106 in the previous frame, which can also be seen to contain the pixels illustrated in 105, i.e. a good match has been found. Superimposed back onto the current frame, the matching pixel data area is at 107. The motion vector associated with block 104 is therefore as illustrated by arrow 108.
Rather than exhaustively consider every possible location, many block based motion estimators select their output motion vector by testing a set of motion vector candidates with a scoring function such as a sum of absolute differences (SAD) or mean of squared differences (MSD), to identify motion vectors which give the lowest error block matches. FIG. 2 illustrates the candidate evaluation process for the block 201 in the current frame which has pixel contents shown in 211. In this simple example system, three motion vector candidates 206, 207 and 208 are considered which correspond to candidate pixel data areas at locations 202, 203 and 204 in the previous frame. The pixel contents of these pixel data areas can be seen in 212, 213 and 214 respectively. It is apparent that the pixel data at location 202 provides the best match for block 201 and should therefore be selected as the best match/lowest difference candidate. Superimposed back onto the current frame, the matching pixel data area is at 205 and the associated motion vector is 206.
Motion vectors are known to be highly correlated both spatially and temporally with vectors in adjacent blocks, so these neighbouring vectors are often used as the basis for the set of candidate motion vectors considered in the motion estimation for a particular block. A random element may also be incorporated into the candidates to allow the system to adapt as the motion in the video changes. Where a block has motion that is not simply predicted by its neighbours, a system may rely on random perturbation of vector candidates known as jitter. This works well for slowly changing vector fields, but tends not to allow the motion estimator to converge rapidly to a new vector where it is very different to its neighbours. A system relying on randomness may wander towards the correct motion over time, but is prone to becoming stuck in local minima, or converging so slowly that the motion has changed again by the time it gets there. It is therefore desirable to introduce candidates that can more accurately predict new and changing motion or improve the selection of candidate motion vectors to improve the speed of convergence of the vector field. The number of candidate motion vectors tested for each block is often a compromise between choosing a set large enough to identify true motion and/or provide good matches with a low residual error, while being small enough to minimize computational expense.
Video sequences typically comprise a series of non interlaced frames of video data, or a series of interlaced fields of video data. The interlaced sequences are produced by fields which carry data on alternate lines of a display, such that a first field will carry data for alternate lines, and a second field will carry data for the missing lines. The fields are thus spaced both temporally and spatially. Every alternate field in a sequence will carry data at the same spatial locations.
Not all video sequences are comprised of “real” images such as may be produced by a video camera. Applications such as games, virtual reality environments, Computer Aided Design (CAD) systems, etc., typically output a series of images which may be referred to as artificially generated video sequences.
In computer graphics, and particularly in 3D computer graphics, a number of coordinate systems are commonly used. FIG. 8 shows three important coordinate systems. The world space is a space with an arbitrary origin, 800, in which a camera (or eye) point, 810, a screen position, 820, and three objects, 830, 840, and 850, are shown in plan view. The direction in which the camera is pointing is shown as 860. An initial step in rendering this scene is to transform the objects into the camera space. In the camera space, also shown in plan view, the camera is at the origin and points along the z axis. The screen 820, is perpendicular to the view direction. A second step projects the objects into screen space, where the x,y position of an object on the screen depends not only on its x,y position, but also its z coordinate in the camera space. This is therefore a perspective projection, which helps to give the scene a “three dimensional” appearance.
In a motion estimation system processing a conventional video sequence, the movement of an object is considered to be the distance that the object's representation on the display screen moves between frames. The motion estimation process occurs entirely in screen space. In reality, the motion of an object on the display screen is determined by the motion of the object in the world space, the projection of that motion onto the screen, and also upon any change in the position and orientation of the camera. This is true for both video sequences and artificially generated sequences, but can present a particular problem in artificially generated sequences such as 3D games, where rapid motion is often combined with sudden changes in view direction. These camera movements cannot easily be predicted by the motion estimator, and motion estimation performance suffers as a result.
In order to render an artificial scene, the graphics engine responsible for creating the sequence of frames must have knowledge about objects in the scene as well as details about the camera position and orientation. While the position and motion of objects in a scene is usually unavailable outside of the graphics engine, it is common for graphics engines to provide an API (application programming interface) which allows some information to be made available to other applications. Conveniently, many APIs provide details of the camera location and orientation, often in the form of matrices describing the transformation from world to camera space, and the projection into screen space. It is also often possible to access depth (or ‘Z’) buffer information, which stores the depths of objects at each pixel position in the screen space rendered image.
Where the video sequence has been produced using a conventional 2D video camera, camera position and depth information is not normally available. Nevertheless, if this information, or an approximation to it, can be produced, then this invention may still be used to improve motion estimation. Possible approaches to approximating camera location, orientation and distance to objects in a scene may be derived using “Structure from Motion” techniques in the field of Computer Vision.