Digital video transmission, in particular streaming over communication channels such as over the Internet generally requires the video to be encoded prior to transmission. It is preferably to be able to compress digital video in a way that minimises bandwidth usage, whilst at the same time can deliver smooth video of an adequate quality.
A number of widely used standards exist, such as MPEG-1 and MPEG-2, which specify the form of digital video encoding. However, these standards do not constrain every aspect of converting image sequences of digital video between uncompressed and compressed formats. One of a number of proprietary functions may be implemented within the framework of these established standards. As such, there is an opportunity for the design of encoder methods and systems to be modified and improved.
One area of digital video encoding is intra-frame coding in which compression is applied to the information of a single frame of a video image sequence. This generally includes the application of techniques known in the art such as;                a) shifting the frame from the RGB into the YCbCr colour space and reducing the chrominance information by up to a quarter;        b) applying a DCT to the frame and applying a quantisation matrix;        c) run-length amplitude/variable length coding the frame; and        d) using rate control to prevent buffer underflow/overflow.        
Another area of digital video encoding is inter-frame coding in which similarities between a series of image sequences (i.e. the temporal redundancies) are exploited. In particular, an encoder can forward predict a future frame (P-frames) from a starting intra-frame (or I frames). In addition, bi-directional interpolated prediction frames (B-frames) can be used to forward and backward interpolate the frames of a video sequence. In each case, encoding the temporal prediction information generally involves the use of a technique known as motion estimation.
Motion estimation involves comparing frames in a sequence and representing the change between the frames such that only the portions that are different from one frame to another need be transmitted. This analysis involves determining how portions of an image may have moved over time, between frames—a so-called ‘motion search’ or ‘motion estimation’. For example, for a video sequence showing an airplane moving across a uniformly blue sky, a motion search will be conducted to determine how the portion of the image containing the airplane changes from one frame to the next. If a video sequence involves a camera-pan of a stationary environment, for example of a garden, then there will be uniform change in displacement of almost every image portion in the same direction. If the video to be encoded displays players on a sports field heading in different directions at different speeds, each portion of one frame to the next will need to be tracked individually. In all cases, determining how individual image portions have moved allows as much image information as possible to be carried forward from one video frame to the next, reducing the amount of ‘new’ information that needs to be transmitted.
A macroblock represents a fundamental ‘portion’ of a video frame. Macroblocks are usually 16×16 pixels in size, although other block sizes (e.g. 8×16, 16×8, 8×8, 4×8, 8×4, and 4×4) are possible by regularly sub-dividing the fundamental 16×16 macroblock. Motion search is conducted within the YCbCr colour space on each luminance macroblock, one macroblock at a time, starting at the top left-hand macroblock, proceeding row-wise left to right, then top to bottom. For each macroblock, a two-dimensional spatial search is carried out to determine how each macroblock has changed its position over a series of frames. The change in the position of a macroblock from one frame to the next is encoded as a motion vector. Thus motion vectors can be used in mapping the spatial displacement of macroblocks from one video frame to the next.
How this spatial search is conducted is one of the aspects not constrained by the MPEG-1 or 2 standards, and this is the subject matter to which the present invention particularly relates.
When conducting a search, it is necessary to determine whether a good enough match has been made between one macroblock and the next. The quality of a match may be determined by calculating the difference between two macroblocks. One well-known measure of the difference is termed the ‘sum of absolute differences’ (SAD) the result of which is generally referred to as ‘distortion’. A challenge is to find the minimum distortion that will yield the appropriate motion vector for a given macroblock.
The spatial search is usually confined to a small area surrounding the macroblock for which is search is being carried out. This is because it is computationally too expensive for an encoder to search the entirety of a frame for a match—especially if encoding needs to be performed in real-time. In addition, since larger motion vectors can take up significantly more bandwidth than smaller motion vectors, it can be better to transmit a smaller motion vector to a relatively distorted match than a larger motion vector to a better match.
For this reason, the conventional range of possible movement of a macroblock from one frame to the next is confined. For example, a standard 16×16 macroblock is generally confined to +/−16 pixels in the vertical and horizontal directions—corresponding to a search area of 48×48 pixels or a 33×33 search extent. In a so-called ‘exhaustive motion search’, every location within the 33×33 search extent is searched and the result yielding the minimum distortion is selected.
Whilst the exhaustive motion search is comprehensive, it is not necessarily considered to be appropriate for applications for which the efficiency of encoding is a priority. As a result, less comprehensive motion searches that yield relatively good results tend to be more desirable. An example of such a non-comprehensive search is a ‘diamond motion search’.
The conventional diamond motion search is based on the premise that image portions within a video sequence will usually travel very short, or no distance from one frame to the next. As a result, the nearby locations surrounding a macroblock are searched first to see which yields a minimal distortion. From this it can be inferred which area is the most promising for further searching. Further searching is conducted in a similar manner, and so the diamond motion search gradually ‘zeroes in’ on a low distortion area.
In particular, the locations above, below, to the left and to the right of a macroblock (in a ‘diamond shape’) are the first analysed. If the best location is the position below the original macroblock, then the next iteration of the search is conducted around that location and continues until a better match cannot be found—i.e. until it appears that the best motion vector for a macroblock has been located.
Whilst the diamond motion search is efficient, it suffers from the drawback of being susceptible to local maxima. This is because initially unpromising search locations are further ignored (and so it can be said that the search is terminated prematurely before the best match in actuality has been found). Furthermore, the diamond motion search requires a check to be made following each iteration to determine the best candidate direction for further searching, which can be computationally demanding. In addition, it is a poor technique for video sequences containing rapidly moving objects for which macroblock translation may be large in comparison the effective search area.
Another technique used to minimise the amount of information transmitted between frames is motion prediction. The information that this technique aims to minimise is that relating to motion vectors. It works on the assumption that a number of contiguous macroblocks within a frame are likely to have similar motion vectors. For example, in the above example in which there is a camera-pan across a stationary environment, the motion vectors of virtually all the macroblocks will be highly correlated. As a result, motion prediction takes into account the macroblocks for which motion vectors have already been calculated. Thus a ‘predicted motion vector’ can be used for subsequent macroblocks. To minimise bandwidth usage, the difference between the actual motion vector for that subsequent macroblock, and the predicted motion vector is transmitted. In particular, a ‘global motion vector’ may be set from which individual motion vectors deviate.
Motion prediction may be used to seed macroblocks to be searched, and in the example of the diamond motion search, a preferred starting location for the diamond motion search may be decided as a result of the outcome of motion prediction for previous macroblocks. Whilst this approach generally yields an efficient outcome, this is not always the case. For example, motion prediction is not necessarily desirable in cases where the movement of adjacent macroblocks are not highly correlated—for example, where there are many image objects moving at in different directions.
The development of technology in this area is focussed on providing more efficient algorithms that can determine a very good match for a macroblock whilst not necessarily needing to conduct an extensive search. This is so that a video stream encoder can perform encoding quickly and efficiently. However, this approach can cause detriment to the quality of a video stream in a number of cases in which there is quick and/or non-correlated movement of objects within a video sequence. This is not satisfactory within an operational environment in which broadcast quality is of great importance.
In particular, when there is a motion search conducted, the number of steps taken to find the best match can vary significantly depending on the nature of the frames being encoded. This leads to non-determinism within such ‘efficient’ algorithms that is not necessarily suitable for encoding tasks demanding a guaranteed video quality, especially for applications such real-time high-definition television broadcasting.
On the one-hand, conducting an exhaustive motion search can guarantee the best possible quality encoding (therefore making the best use of the channel bandwidth). However, the more comprehensive the motion search, the more computationally expensive it is, and so the longer encoding can take. This is a significant consideration when there are hard time limits imposed on encoding tasks, for example during real-time encoding.
One solution to this problem is to provide hardware that is capable of conducting the computationally expensive calculations within the time constraints—for example, motion searching may be parallelised. However, such hardware comes at a cost of greater complexity and ‘silicon real-estate’ on a circuit board—and so a greater cost in financial terms as well.
Therefore, there is a need for a method and system for motion search that provides an optimal trade-off between hardware costs, timing constraints, quality of video, and bandwidth limitations.