Video compression is used in many current and emerging products. It has found applications in video-conferencing, video streaming, serial storage media, high definition television (HDTV), and broadcast television. These applications benefit from video compression in the fact that they may require less storage space for archived video information, less bandwidth for the transmission of the video information from one point to another, or a combination of both.
Over the years, several standards for video compression have emerged, such as the Telecommunication Standardization Sector of the International Telecommunication Union (ITU-T) recommended video-coding standards: H.261, H.262, H.263 and the emerging H.264 standard and the International Standardization Organization and International Electrotechnical Commission (ISO/IEC) recommended standards MPEG-1, MPEG-2 and MPEG4. These standards allow interoperability between systems designed by different manufacturers.
Video is composed of a stream of individual pictures (or frames) made up of discrete areas known as picture elements or pixels. The pixels are organised into lines for display on a CRT or the like. Each pixel is represented as a set of values corresponding to the intensity levels of the luminance and chrominance components of a particular area of the picture. Compression is based mainly on the recognition that much of the information in one frame is present in the next frame and, therefore, by providing a signal based on the changes from frame to frame a much-reduced bandwidth is required. For the purpose of efficient coding of video, the pictures or frames can often be partitioned into individual blocks of 16 by 16 luminance pixels and blocks of 8 by 8 chrominance pixels, where a block of 16 by 16 luminance pixels and its corresponding two blocks of 8 by 8 chrominance pixels is called a “macroblock”. This practice simplifies the processing which needs to be done at each stage of the algorithm by an encoder or decoder. To encode a macroblock (or sub-macroblock partition) using motion-compensated prediction, an estimation is made of the amount of motion that is present in the block relative to the decoded pixel data in one or more reference frames (usually recently decoded frames) and the appropriate manner in which to convey the information from which the current frame may be reconstructed. The residual signal, which is the difference between the original pixel data for the macroblock (or sub-macroblock partition) and its prediction, is spatially transformed and the resulting transform coefficients are quantized before being entropy coded. The basic processing blocks of an encoder are a motion estimator/compensator/predictor, a transform, a quantizer and an entropy coder.
Motion vectors are transmitted in the bitstream in order to convey information about the motion within a video sequence and provide an efficient coded representation of the video. Each motion vector conveys the translational motion information for a rectangular block in the current picture with respect to a set of blocks from previously coded and stored reference pictures. For efficient compression, motion vectors are coded differentially by forming a prediction for each motion vector based on motion information of previously coded neighbouring blocks and then transmitting only the difference between the actual motion vector and its prediction. The motion vector prediction is formed identically in both the encoder and decoder. In the encoder, the motion vector difference value to transmit is computed by subtracting the prediction for the actual motion vector. In the decoder, the decoded motion vector difference value is added to the motion vector prediction value in order to compute the actual motion vector.
In general, motion vector values of spatially adjacent blocks (neighbours) that have been previously decoded are used as the basis to form the motion vector prediction of a given block. The order of blocks in a bitstream generally follows a raster-scan order, which begins with the upper-leftmost block in each picture, and proceeds horizontally from left to right across each row of the picture, with the rows being ordered sequentially from top to bottom. Therefore, motion vector values from spatially adjacent blocks that precede the current block in this raster scan order are located in the row of blocks above the current block, as well as to the left of the current block. Median prediction using 3 neighbours has become popular in several recent video standards since it has shown strong correlation with the motion vector being predicted, with moderate complexity. Most commonly, the motion vectors from the blocks immediately above, left, and above-right of the motion block with the vector being predicted are used as the inputs to the median operator to generate the predicted motion vector. The left block is defined to be the block containing the pixel immediately to the left of the leftmost pixel in the current block. The above block is the block containing the pixel immediately above the uppermost pixel in the current block. The above-right block is the block containing the pixel immediately above and to the right of the upper-rightmost pixel in the current block. Finally, the above-left block is the block containing the pixel immediately above and to the left of the upper-rightmost pixel in the current block.
As an illustrative example of median motion vector prediction, consider the following array of blocks and their corresponding motion vector values.
Above-LeftAboveAbove-Right(9, 7)(8, 5)(0, −2)LeftCurrent(6, 4)(9, 6)Each motion vector is expressed with the horizontal component followed by the vertical component of the vector. AT the encoder, the predicted motion vector is computed and then the difference between the prediction and the current vector, which in this case is (9, 6), is transmitted in the bitstream. Assuming that the motion vectors from the left, above, and above-right blocks are used to form the prediction, the predicted motion vector is computed by taking a component-wise median of these input motion vectors as follows:                Horizontal Component=Median (6, 8, 0)=6        Vertical Component=Median (4, 5, −2)=4Therefore the predicted motion vector is (6, 4) and the difference motion vector that is transmitted in the bitstream is computed as (9, 6)−(6, 4)=(3, 2).        
At the decoder, the predicted motion vector is computed identically, since the motion vectors that are used as input to the prediction have already been decoded due to the raster-scan order transmission. The current motion vector is reconstructed by adding the difference motion vector to the predicted motion vector.                MV=Predicted MV+Difference MV=(6, 4)+(3, 2)=(9, 6)Note that the motion vector prediction can be formed in a number of different ways depending on the specific coding standard being employed as well as the availability of neighbouring blocks. For example, in the H.264 video coding standard, if the above-right block is not available because it is beyond the picture boundaries, it may be replaced by the above-left block. In another example, if the current block is in the top row of the picture, only the left motion vector is available for prediction and this value is used directly as the predicted motion vector.        
In the popular MPEG-2 video coding standard, 3 types of pictures, known as Intra (I), Predicted (P) and Bi-directional (B) pictures, are allowed. These picture types are differentiated by the availability of options for forming motion-compensated predictions. In I-pictures, motion compensated prediction is not permitted. Only Intra coding, which does not use prediction from any other picture, is permitted. In P-pictures, each macroblock can be Intracoded or coded using motion-compensated prediction from a single block in the previously coded picture that is also temporally previous in capture/display order. This type of prediction is referred to as uni-prediction, since only one block is used to form the prediction. Furthermore, it is referred to as forward prediction, since the current picture is being predicted from a picture that precedes it temporally. Finally, in B-pictures, motion-compensated predictions can additionally be derived from one temporally subsequent picture that has already been coded. This is referred to as backward prediction, since the current picture is being predicted from a picture that follows it temporally, and it requires that the coding order of pictures is different than the display order. Also, a motion-compensated prediction block can be formed by averaging the samples from 2 reference blocks, one from the previous picture and one from the subsequent picture. This averaging of two blocks is referred to as bi-prediction, and since the predictions are derived from two different temporal directions, it is also referred to as bi-directional. To summarize, in B-pictures, the prediction of each block can either be derived from a single block in the temporally previous picture (forward uni-prediction), a single block in the temporally subsequent picture (backward uni-prediction) or the average of two blocks, one from each of these two pictures (bi-directional bi-prediction).
The recent H.264 video coding standard allows similar prediction modes, but the use of reference pictures is much more flexible and generalized. First, the available prediction modes are not required to be the same for an entire picture, as in MPEG-2, but can be changed from slice to slice, where each slice contains a subset of the macroblocks of a picture. Thus, the H.264 standard refers to I-, P-, and B-slices, rather than I-, P-, and B-pictures, since different slice types can be mixed within a picture. Similar to the MPEG-2 standard, in I-slices, all blocks are Intra-coded without reference to any other picture in the video sequence. In P-slices, blocks can be Intra-coded or coded with motion-compensated prediction from a single block in a previously coded picture (uni-prediction) And in B-slices, bi-prediction is performed, where a block is predicted from two blocks from previously coded pictures is additionally permitted. However, the constraints in the MPEG-2 standard, which restrict which previously coded pictures can be used to predict the current picture, are greatly relaxed in the H.264 standard.
In MPEG-2, a maximum of one reference picture is available for predicting P-pictures, and a maximum of two reference pictures is available for predicting B-pictures. However, H.264 specifies a generalized buffer of multiple previously coded pictures that have been designated as reference pictures and are identically stored in both the encoder and decoder from which predictions can be derived for both P- and B-slices. For each motion-compensated prediction, a reference picture selection is included in the bitstream along with the spatial motion vector information. The reference picture selection specifies the picture from the set of available pictures from which the motion-compensated prediction is derived. Depending on the relationship between the coding order of pictures and the display order of the pictures, different possibilities exist for the prediction of each picture in terms of the temporal direction of the available reference pictures. In the most general case, the reference picture buffer contains multiple pictures that are temporally previous to the current picture in display order, as well as multiple pictures that are temporally subsequent to the current picture in display order. In this general case, a bi-predictive block in a B-slice can be derived from one block in each of the two temporal directions (as in the MPEG-2 case), or from two blocks from temporally previous reference pictures (possibly the same picture), or two blocks from temporally subsequent reference pictures (possibly the same picture).
For the purpose of coding the reference picture selections, the pictures in the reference picture buffer are organized into two ordered sets of pictures. It is possible that the same reference picture is included in both sets. In the terminology of the H.264 standard, these ordered sets of pictures are referred to as List 0 and List 1. The reference picture from which a motion-compensated prediction is derived is specified by transmitting an index into one of these ordered sets in the bitstream. In P-slices, only the List 0 set may be used. AU motion-compensated blocks use uni-prediction from List 0. In B-slices, both List 0 and List 1 can be used. Each block can either be uni-predicted from a single picture in List 0 or from a single picture in List 1, or a block can be bi-predicted by selecting one picture from List 0 and one picture from List 1. Since the same pictures may appear in both lists, it is possible that in the bi-predictive case, the List 0 picture and the List 1 picture are actually the same picture. Thus, it is also possible that both pictures used to form a bi-predicted block are in the same temporal direction from the current picture. Most commonly, List 0 is used for referencing pictures that are temporally previous, and List 1 is used for reference pictures that are temporally subsequent, but in many cases, the use of the two lists is not restricted to following this convention.
Prior standards that used motion vector prediction did not define a flexible multiple reference picture buffer as in H.264, and did not organize these pictures into two ordered sets of reference pictures. Thus, special consideration of the new cases that occur in H.264 relating to the temporal direction and list usage of the neighbouring motion vectors used to generate motion vector predictions must be made. This consideration should strike a balance between complexity and coding efficiency.
The prior art method of specifying motion vector prediction in a codec with two lists of reference pictures and a flexible reference picture buffer is described in the document “Joint Final Committee Draft (JFCD) of Joint Video Specification (ITU-T Rec. H.264|ISO/IEC 14496-10 AVC)” by the Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG. In the prior art method, motion vector prediction in B-slices is specified in a way that entails high computational complexity and complex data-dependencies. Motion vectors from neighbouring blocks are selected based on the temporal direction of the reference pictures used with these motion vectors in relation to the temporal direction of the reference picture used with the motion vector being predicted. Only forward motion vectors (referencing temporally previous pictures) are used to predict forward motion vectors, and only backward motion vectors (referencing temporally subsequent pictures) are used to predict backward motion vectors. If a neighbouring block does not have a motion vector referring to a reference picture in the same temporal direction as the current motion vector, both components of the motion vector predictor for that neighbouring block are set to zero for the purpose of prediction. If there is only one neighbouring motion vector that uses a reference picture in the same temporal direction as the current vector, the motion vector prediction is set equal to this motion vector and other neighbours are ignored. The prediction for a vector in either of the two lists could come from a neighbouring prediction using the same list, or the opposite list, depending on the temporal direction and relative temporal distances of the reference pictures.
Furthermore, in the prior-art method, a special case known as “scaled motion vector prediction” is used to generate the prediction for a List 1 motion vector for bi-predicted blocks in which both the List 0 and List 1 reference pictures are in the same temporal direction. In this case, the prediction of the List 1 motion vector is computed by temporally scaling the decoded List 0 motion vector from the same block, based on the relative temporal distances between the reference pictures to which these two motion vectors refer.
Prediction in P-slices, where only uni-prediction using List 0 is used, is much simpler. Here, temporal direction has no effect on the generation of the median motion vector prediction. Three spatial neighbours are used regardless of the temporal direction of the reference pictures that they refer to with respect to the temporal direction of the reference picture referred to by the current block.
The following examples illustrate the motion vector prediction process in B-slices in the prior art method. In the figures below, each motion vector is expressed using the following notation:                [ListNumber]: (Horizontal Component, Vertical Component), #ReferencePicture where ListNumber is either 0 or 1, and indicates the list of reference pictures to which the motion vector refers, and ReferencePicture indicates the display order number of reference picture used to generate the motion-compensated prediction for the block.        
AboveAbove-Right[0]: (5, 0), #6[0]: (5, 2), #6[1]: N/A[1]: N/ALeftCurrent[0]: (9, 1), #3[0]: refers to #6[1]: (−4, 0), #9[1]: refers to #9In this example, the above and above-right blocks only use List 0 prediction, so there are no List 1 motion vectors in these blocks. In the current block being predicted, the actual motion vector value is unknown in the decoder and not relevant for generating the prediction, but the reference pictures used for the prediction are required. Also, the display order number of the current pictures is also needed, since this determines the temporal direction in which each of the neighbouring motion vectors points. In this example, assume that the current picture is #8.
Since the current picture is #8, lower numbered pictures are temporally previous and higher numbered pictures are temporally subsequent to the current picture. The List 0 motion vector in the current block refers to picture #6, which is temporally previous (forward prediction), so, in the prior-art method, only neighbouring motion vectors that also refer to temporally previous pictures will be used to form the prediction of the current List 0 motion vector. In this case, all of the List 0 motion vectors from the above, above-right, and left block use forward prediction, so these 3 motion vectors are used to compute the the median motion vector. The resulting median motion vector is equal to (5, 1), which is the component-wise median of (9, 1), (5, 0), and (5, 2).
The List 1 motion vector in the current block refers to picture #9, which is a temporally subsequent picture (backward prediction). Thus, only backward motion vectors from the neighbouring blocks are used to compute the median motion vector. In this case, only the left block contains a motion vector (−4, 0) that is in the same temporal direction. Thus, the motion vector prediction value is equal to (−4, 0).
A second illustrative example of the prior-art motion vector prediction method is given in the figure below.
AboveAbove-Right[0]: (5, 0), #6[0]: (5, 2), #6[1]: (6, −1), #6[1]: N/ALeftCurrent[0]: (12, 0), #3[0]: refers to #6[1]: (9, 1), #6[1]: refers to #3In this example, assume that the current picture being encoded is #9 in display order. Note that in this example, all of the motion vectors refer to temporally previous reference pictures. The List 0 motion vector in the current block refers to picture #6. The above and left block both contain 2 pictures in the same temporal direction as the List 0 vector in the current block, so additional criteria are specified in the prior art to select between these. In the case that exists in the above block where both motion vectors refer to the same picture as the current motion vector, the prior art specifies that the List 0 motion vector (5, 0) be selected. From the left block, both motion vectors refer to pictures that are in the same temporal direction, but refer to different reference pictures. The prior art specifies that the motion vector referring to the temporally closest picture be selected, which in this case is the List 1 motion vector (9, 1). The above-right block contains one motion vector that is in the same temporal direction as the current motion vector. This vector, with a value of (5, 2), is also used to compute the median motion vector. Therefore, the resulting median motion vector for the prediction of the List 0 motion vector of the current block is equal to (5, 1), which is the component-wise median of (9, 1), (5, 0), and (5, 2).
In this example of the prior-art method, the prediction of the List 1 motion vector in the current block uses the special case of scaled motion vector prediction instead of median prediction. In this case, the prediction for the List 1 motion vector is computed by scaling the List 0 motion vector from the current block, based on the relative temporal distances of the reference pictures referred to by these two motion vectors. The temporal distance from the current picture (#9) to the picture referred to by the List 0 motion vector (#6) is equal to 3. The temporal distance from the current picture to the picture referred to by the List 1 motion vector (#3) is equal to 6. Since the temporal distance for the List 1 motion vector is double that of the List 0 motion vector, the scaled motion vector prediction of the List 1 motion vector is equal to the reconstructed List 0 motion vector multiplied by 2.
A third illustrative example of the prior-art motion vector prediction method is given in the figure below.
AboveAbove-Right[0]: (5, 0), #6[0]: (5, 2), #6[1]: (−6, −1), #9[1]: (−8, −3), #9LeftCurrent[0]: (12, 0), #3[0]: refers to #6[1]: (9, 1), #6[1]: refers to #9In this example, assume that the current picture being encoded is #8 in display order. The List 0 motion vector in the current block refers to picture #6, which is a temporally previous picture. The left block contains 2 pictures in the same temporal direction as the List 0 vector in the current block, so additional criteria are specified in the prior art to select between these. As in the previous example, the prior art specifies that the motion vector from the left block that refers to the temporally closest picture will be selected, which in this case is the List 1 motion vector (9, 1). The above and above-right blocks each only contain one motion vector that refers to a temporally previous reference picture, so these motion vectors are selected for input to the median filter. The resulting median motion vector is equal to (5, 1), which is the component-wise median of (9, 1), (5, 0), and (5, 2).
For the prediction of the List 1 motion vector, which refers to reference picture #9, only motion vectors that refer to temporally subsequent reference pictures are selected. Since the left block does not contain such a motion vector, a motion vector with value (0, 0) is used as input to the median operator in its place. The above and above-right blocks each only contain one motion vector that refers to a temporally subsequent reference picture, so these motion vectors are selected for input to the median filter. The resulting median motion vector is equal to (−6, −1), which is the component-wise median of (0, 0), (−6, −1), and (−8, −3).
The above examples illustrate the prior-art method used for selecting the neighbouring motion vectors that are used to form the motion vector prediction in the draft H.264 video coding standard. The major disadvantage of this method is that its complexity is high. The selection of the neighbouring motion vectors based on their temporal direction and relative temporal distances requires a large number of conditions to be tested in order to determine which vectors will be used to form the prediction. Moreover, the computation of the scaled motion vector prediction requires complex division operations to be performed in order to temporally scale the decoded List 0 motion vector to generate the prediction for the List 1 motion vector. Finally, the fact that the motion vector predictions of each list are dependent upon the motion vectors in the other list requires that the motion vectors for both lists in each partition of a macroblock be decoded sequentially, rather than computing all of the List 0 motion vectors for an entire macroblock, followed by all of the List 1 motion vectors for that macroblock (or vice-versa).
It is an object of the present invention to provide a method of selecting neighbouring motion vectors for use in motion vector prediction to obviate or mitigate some of the above-presented disadvantages.