1. Field
The following relates to a method and apparatus for motion estimation for use in video sequences, and, in particular to methods associated with the introduction of candidate motion vectors taken from an external source.
2. Related Art
Motion estimation is used in various video techniques, and a wide range of methods for motion estimation are well known. One common method, known as block based motion estimation, will be used for illustration purposes in this document.
Block based motion estimation generally takes two or more consecutive frames from a video sequence and subdivides them into multiple regions known as blocks or macroblocks. In a motion search procedure, pixel data in each block in a frame is compared with pixel data from various candidate locations in a previous frame. The relative position of the candidate that gives the best match gives a vector that describes the motion in the scene at that block position. Collectively, the set of motion vectors at each block position in a frame is known as the motion vector field for that frame.
Video sequences typically comprise a series of non interlaced frames of video data, or a series of interlaced fields of video data. The interlaced sequences are produced by fields which carry data on alternate lines of a display, such that a first field will carry data for alternate lines, and a second field will carry data for the missing lines. The fields are thus spaced both temporally and spatially. Every alternate field in a sequence will carry data at the same spatial locations.
FIG. 1 illustrates a typical example of a block matching motion estimator. In all the figures, including FIG. 1, motion vectors are shown with the head of the arrow at the centre of the block to which the vector corresponds. The frames are divided into blocks, and an object 101 in the previous frame has moved to position 102 in the current frame. The previous position of the object is shown superimposed on the current frame as 103. Motion estimation is performed for blocks rather than for objects, where a block of pixels in the current frame is matched with a block sized pixel area in the previous frame which is not necessarily block aligned. For example, block 104 is partially overlapped by the moving object 102, and has contents as illustrated at 105. Motion estimation for block 104, if it performs well, will find the pixel data area 106 in the previous frame, which can also be seen to contain the pixels illustrated in 105, i.e. a good match has been found. Superimposed back onto the current frame, the matching pixel data area is at 107. The motion vector associated with block 104 is therefore as illustrated by arrow 108.
Many block based motion estimators select their output motion vector by testing a set of motion vector candidates for a block using a method such as a sum of absolute differences (SAD) or mean of squared differences (MSD), to identify motion vectors which give the lowest error block matches. FIG. 2 illustrates the candidate evaluation process for the block 201 in the current frame which has pixel contents shown in 211. In this simple example system, three motion vector candidates 206, 207 and 208 are considered which correspond to candidate pixel data areas at locations 202, 203 and 204 in the previous frame. The pixel contents of these pixel data areas can be seen in 212, 213 and 214 respectively. It is apparent that the pixel data at location 202 provides the best match for block 201 and should therefore be selected as the best match/lowest difference candidate. Superimposed back onto the current frame, the matching pixel data area is at 205 and the associated motion vector is 206.
Different systems have different requirements of motion estimation. In a video encoder application, for example, the requirement is to form the most compact representation of a frame, by using motion vectors to reference pixel data from a previous frame from the sequence. These motion vectors generally focus on providing the “closest match” to a block of pixel data (or the lowest residual error), and while the resulting motion vectors are usually representative of the actual motion of objects in the scene, there is no requirement that this is always the case. In other applications, such as de-interlacing or frame rate conversion, where objects in the frame must be interpolated at intermediate positions between their locations in the source frames, it is more important that the motion vectors represent the “true motion” of objects in the scene, even if other distortions in the video mean that those vectors do not always give the closest match (or lowest residual error) between blocks of pixel data. By applying appropriate constraints to the candidate motion vectors during motion search, the results can be guided towards “closest match” or “true motion” as necessary.
Motion estimation and the vector fields produced can be generated using vastly different levels of computational resources. Encoders used by broadcasters, or for movie distribution, for example, may dedicate significant computational resources or extended offline processing time to producing vector fields of the highest quality. Conversely, many consumer level video pipelines, particularly those in handheld devices, must operate in real time and with significant limitations on the amount of computation resource (i.e. bandwidth, power and time) allocated to motion estimation. Consequently, in these systems it is impractical to apply exhaustive search and intensive optimization processes, and this typically results in sub-optimal motion vector fields being produced.
One common approach to achieving the highest quality motion vector field within a computational resource limited environment is to identify and test a small set of motion vector candidates for each block. The challenge is in identifying the smallest possible set of vector candidates while still retaining a high probability of including in the set, one or more vector candidates that provide either a close pixel match or true motion match as required. Improving the set of candidate motion vectors allows either fewer motion vectors to be tested (improving efficiency) or increases the likelihood of a close pixel match or a true motion match being found (improving quality).
Motion vectors are known to be highly correlated both spatially and temporally with vectors in adjacent blocks, so these neighbouring vectors are often used as the basis of a motion estimator's set of vector candidates. A pseudo-random element may also be incorporated into the candidates to allow the system to improve its matches, or to adapt as the motion in the video changes. Where a block has motion that is not simply predicted by its neighbours, the pseudo-random perturbation of vector candidates can often predict the changes in motion. This method works well for slowly changing vector fields, but tends not to allow the motion estimator to detect or converge quickly upon new motions that are significantly different to the motion vector candidates stored in neighbouring blocks. A system relying on pseudo-randomness may wander towards the new motion over time, but is prone to becoming stuck in local minima, or converging so slowly that the motion has changed again by the time it gets there.
FIG. 3 shows a simplified example of a conventional video pipeline architecture. In this simplified architecture the decode block 310 decodes a compressed input bit stream 300 into a set of motion vectors 312 and a residual 311. When combined in the picture builder 313 a sequence of output images are produced. This sequence of output images may consist of either progressive frames or interlaced fields depending upon the nature of the source. Interlaced fields are converted into progressive frames by a deinterlacer 320. High quality deinterlacing typically performs motion estimation 321 followed by a picture build 322 process in a manner which will be well known to those skilled in the art. Optionally, overlays 330 such as subtitles and/or user interface may be added to the video sequence by first identifying the location of the overlay pixels to produce an overlay mask 331 and then, in the region defined by the overlay mask, compositing the overlay's pixel data and the original video pixel data in an overlay compositing engine 332. Finally frame rate conversion 340 is performed to convert the input frame rate of the video sequence to the output frame rate required by the display 301. Frame rate conversion 340 typically requires motion estimate 341 and picture build 342 processes in a manner which will be well known to those skilled in the art.
It is apparent from the simple example system shown in FIG. 3 that multiple motion estimation processes and other vector sources that are not motion estimators may exist in a typical video pipeline. The decode block's motion vector field 312 will find closest match/lowest error vectors that have typically been determined using significant computational resources external to the video pipeline. Motion estimation 321 in the deinterlace block 320 must determine true motion vectors using field data and the motion estimation in the frame rate conversion block 341 must determine true motion vectors using frame data. It is also possible for the overlay mask 331 to define regions of the frame with known motion vectors (e.g. either a static or animated overlay). Performing motion estimation at each of these locations seems inherently wasteful and it is proposed that a motion estimator later in the pipeline could be improved by using the motion vectors generated earlier in the pipeline.
Conventional video pipeline systems tend not to reuse motion vectors partly because the individual blocks (such as decode, deinterlace, overlays and frame rate conversion) tend to be designed independently, often coming from different vendors. In these distinct blocks each block has little to no visibility of the internal workings of the other blocks in the video pipeline. More crucially, video pipeline systems tend not to reuse motion vectors from elsewhere in the pipeline because of the different requirements of the motion estimators available. For example, the motion vector field available at 312 containing vectors produced to identify the closest matching pixels (or lowest residual) will have no requirement that the motion vectors represent the true motion of objects in the scene and therefore is a potentially poor source of vector candidates for both the deinterlace motion estimator 321 and the frame rate conversion motion estimator 341. While the deinterlace motion estimator 321 and the frame rate conversion motion estimator 341 both produce motion vector fields looking for the true motion of objects in the scene, one is working on field data and one is working on frame data giving rise to different, albeit related, motion vector fields.
FIG. 4 illustrates the issues that would be faced by a conventional video pipeline if it were to try and reuse motion vectors from another motion estimator in the pipeline or from an external source. Two consecutive frames 410 and 420 from a video sequence are shown from time instances t=−1 and t=0 respectively. In these frames an aeroplane located at 411 is flying diagonally down and right across the scene to location 421. Simultaneously the entire background (sun 412 and sky 413) are panning left behind the aeroplane (with sun moving to location 422 and sky moving to location 423). A true motion vector field for these frames is shown as 430, where the true motion of the aeroplane 411 is shown by the closed head arrow motion vectors, and the true motion of the background sun 412 and sky 413 are shown by the open head arrow motion vectors. Contrast these true motion vectors with a representative motion vector field produced by a motion estimator requiring the closest matching pixels 440 (with zero motion vectors shown as black dots). While some motion vectors happen to be the same or similar to the true motion of the objects in the frame, there are clearly significant differences in the produced motion vector fields. Specifically, the lack of detail in the sky regions 413 and 423 and the lack of detail in the centre of the sun object 412 and 422 have led to the selection of motion vectors that provide very good pixel matches between the two frames but which are not representative of the true motion of the objects. Also note that a typical encoder considers some vectors as being “easier” or “cheaper” to encode than others. This may lead to the selection of a motion vector that is not necessarily the best pixel match but can be represented in a more compact form. For example, the block 424 containing the back of the plane 421 in frame 420 is matched sufficiently well by an area of pixels 414 containing the front of the plane 411 in frame 410, that the shorter vector is selected, even though the vector found is not representative of the true motion of the plane.
The differences in the requirements of the various motion vector fields present significant risks to their re-use in the later stages of a conventional video pipeline. Motion estimators looking for the true motion of objects typically include the quality of the pixel match as part of their vector candidate evaluation process. It is therefore inherently risky to include motion vectors that have very good pixel matches but do not represent true motion as they can confuse the vector selection process. Similarly, it may be wasteful to include motion vector candidates that identify the true motion of objects in a motion estimator trying to find the closest match (or lowest residual) because that motion vector candidate may have been better used in finding a local minimum or testing a vector candidate that could be represented in a more compact form.
The risks and inefficiencies of including motion vector candidates from an external source vector field are typically so high in a conventional video pipeline that they are not used. One common exception is a transcoder which will be well known to those skilled in the art. In a transcoder it is often known that the motion vector field present in the decode block will be suitable for use in a subsequent encode block. As the motion vectors provided by the decode process and the motion vectors required by the encode process have the same requirements it is common for the decoder's motion vectors to be used directly by the encoder without requiring further motion estimation. The transcoder is a special case where an entire motion estimation process can be saved by integration of blocks in the video pipeline. When the decoder motion vectors are suitable for use in the encoder, there is no risk in using them directly, without further motion estimation. The absence of a second motion estimator in the system means that transcoder applications are outside the scope of this invention.