It is well known that multi-hypothesis motion compensation, as described in a first prior art approach, can provide considerable benefits within motion compensated video encoders and decoders. More specifically, bi-predictive (B) slices (or bi-predictive pictures in older standards and recommendations), which consider 2 hypotheses, are in general the most efficiently coded slices within the CODEC for the International Organization for Standardization/international Electrotechnical Commission (ISO/IEC) Moving Picture Experts Group-4 (MPEG-4) Part 10 Advanced Video Coding (AVC) standard/International Telecommunication Union, Telecommunication Sector (ITU-T) H.264 recommendation (hereinafter the “MPEG-4 AVC standard”). This behavior is due to these slices being able to more efficiently exploit the temporal correlation that exists within a sequence by linearly combining two or more hypotheses together, so as to reduce their associated error. More specifically, in B slices, a macroblock or block is coded in such a way that it can be predicted by either one prediction (list0 or list1) or the linear combination of two predictions (list0 and list1), while associated weights for each list could provide additional benefits in the presence of fades or cross-fades. To perform this prediction, the decoder only requires that 1 or 2 motion vectors (MVs), depending on the prediction type, and their associated references (one for each associated list), are transmitted within the bitstream, or that these are inferred as in the case of direct modes.
Apparently in most encoders such as, for example, the current JM reference software, Joint Video Team (JVT) Reference Software version JM7.4, motion estimation for B slices (and for multi-hypothesis coding in general) considers each candidate reference within the available prediction lists separately during motion estimation and does not make any specific assumption for bi-prediction (or multi-hypothesis prediction respectively). For each prediction list, an encoder calculates the best single prediction motion vectors and, then, using these candidate motion vectors, the encoder generates an additional set of bi-predictive candidates that will be later used within a final mode decision, which will decide which mode (single or multiple prediction or even intra) will be used.
Turning to FIG. 1, a video encoder that uses bi-prediction is indicated generally by the reference numeral 100. A non-inverting input of a combiner 105, a first input of a Mode Decision (MD) & Motion Compensation (MC) 175, a first input of a motion estimator (ME) 165, and a first input of a motion estimator 170 are available as inputs to the video encoder. An output of the combiner 105 is connected in signal communication with an input of a transformer 110. An output of the transformer 110 is connected in signal communication with an input of a quantizer 115. An output of the quantizer 115 is connected in signal communication with an input of a variable length coder (VLC) 120. An output of the VLC 120 is available as an output of the video encoder 100.
The output of the quantizer 115 is also connected in signal communication with an input of an inverse quantizer 125. An output of the inverse quantizer 125 is connected in signal communication with an input of an inverse transformer 130. An output of the inverse transformer 130 is connected in signal communication with a first non-inverting input of a combiner 180. An output of the combiner 180 is connected in signal communication with an input of a loop filter 135. An output of the loop filter 135 is connected in signal communication with an input of a picture reference store 140. An output of the reference picture store 140 is connected in signal communication with an input of a List0 reference buffer 145 and with an input of a List 1 reference buffer 150. A first output of the List0 reference buffer 145 is connected in signal communication with a first input of multiplier 155. A first output of the List1 reference buffer 150 is connected in signal communication with a first input of a multiplier 160. A second output of the List0 reference buffer 145 and a second output of the List1 reference buffer 150 are connected in signal communication with a second input of the MD&MC 175. An output of the multiplier 155 is connected in signal communication with a second input of the motion estimator 165. An output of the multiplier 160 is connected in signal communication with a second input of the motion estimator 170. A first output of the MD&MC 175 is connected in signal communication with an inverting input of the combiner 105. A second output of the MD&MC 175 is connected in signal communication with a second non-inverting input of the combiner 180.
The above method is based on the assumption that these motion vectors are good enough to be used within bi-prediction. Unfortunately this assumption is not always true, potentially resulting in significant loss in efficiency. This is particularly true in the presence of cross-fades (dissolves) where overlapping objects from these images may have considerably different luminance characteristics and possibly motion, and the consideration of each list separately could potentially result in relatively poor performance. Therefore, it would be highly desirable to be able to jointly consider the available candidate references within the motion estimation phase, which could result in higher coding efficiency. On the other hand, this does not imply that each reference should not be considered separately since single prediction could still provide us with good results especially when considering that only a single set of motion vectors (mv0 and mv1 for references x and y, respectively) need to be transmitted in such a case, which is very important at low bitrates.
It is well known that motion estimation for a single candidate is itself rather computationally expensive. If a full search approach was used with a search window of (±N,±M), this would imply that (2N+1)×(2M+1) checking points would be tested. Obviously the brute force and, in a sense, optimal approach for bi-prediction would require (2N+1)2×(2M+1)2 checking points which is rather forbidding for any architecture. In the more general multi-hypothesis (k-prediction) case this would mean that (2N+1)K×(2M+1)K need to be tested. An alternative, considerably simpler, architecture was presented in the above-mentioned first prior art approach, where instead of the brute force method, an iterative approach was used where each hypothesis was examined and refined sequentially by considering the previously estimated hypothesis.
This method could, for a bi-predictive case, be summarized as follows: Assume that the current picture is z, and the two references under consideration are pictures x and y. For such pictures, weights a and b have been respectively selected for weighted prediction (i.e., for normal bi-prediction a=b=½). mv0 and mv1 are the motion vectors needed for motion compensation corresponding to the x and y references (or their weighted counterparts) respectively. For simplicity weighting offsets are ignored in this process, although similar consideration could apply. In the following procedure, SAD (sum of absolute difference) is used as a distortion measure.                Step 1. Set mv0=mv′0=mv1=mv′1=0.        Step 2. Form reference picture as ax.        Step 3. Perform motion estimation in ax to refine motion vectors mv0 using distortion SAD=|z−ax(mv0)−by(mv′1)|.        Step 4. Set mv′0=mv0         Step 5. Form reference picture as by.        Step 6. Perform motion estimation in by to find motion vectors mv1 using distortion SAD=|z−by(mv1)−ax(mv′0)|.        Step 7. If (mv1==mv′1) exit.        Step 8. Set mv′1=mv1         Step 9. Refine motion vectors mv0 in ax using distortion SAD=|z−ax(mv0)−by(mv′1)|.        Step 10. If (mv0==mv′0) exit.        Step 11. Set mv′0=mv0         Step 12. Go to Step 6.        
This may be generalized for the multiple hypothesis case. The problem with this method is that it may still require a large number of iterations. Moreover, while it is very likely that performance would be improved, it is also possible that the final prediction may not be the best possible, especially if the scheme is trapped at a local minimum. The implementation of this architecture is also rather complicated especially considering that it is necessary to reconstruct a new hypothesis using motion compensation for every iteration of the algorithm. An alternative but rather similar approach was proposed in the above-mentioned first prior art approach, where the initial zero motion vectors in step 1 were replaced with the motion vectors generated by considering each list independently.
Video compression encoders and decoders gain much of their compression efficiency by forming a prediction of the current picture (or slice) Pcurrent that is to be encoded, and by additionally encoding the difference between this prediction and the current picture. The more closely correlated the prediction is to the current picture, the fewer the bits that are needed to compress that picture. It is therefore desirable for the best possible picture prediction to be formed. This prediction can be generated through either spatial prediction methods (intra coding) or temporal methods (inter coding).
Temporal prediction methods basically employ motion compensated techniques in order to generate the prediction reference. This is usually done by dividing the source image into non-overlapping blocks of size N×M and by finding the best match within a reference picture Preference, using motion estimation techniques. This best match is associated with a set of motion parameters that are also encoded within the bitstream. Newer standards, such as the MPEG-4 AVC standard, also allow the consideration of multiple reference pictures for the estimation and selection of the best prediction, by signaling the index of the reference used with the motion parameters. Such multi-reference encoders and decoders use a reference buffer, where each potential candidate reference is stored and accessed during encoding or decoding.
An alternative method that can considerably improve performance is to consider not only a single reference picture at a time, but instead the possibility of using combinations of multiple hypotheses as is, in particular, done for bi-predictive (B) picture/slice coding. Here the prediction may be generated through again either considering a single reference selected from a set of multiple references, but also by linearly combining (i.e. performing a weighted average) two available references. This would also require that, if necessary, two different motion parameter sets are estimated and transmitted that correspond to each reference. This concept can be generalized for encoders that consider more than 2 hypotheses as is described in the above-mentioned first prior art approach. Other parameters that could also improve performance include the consideration of weighted prediction as suggested in a second prior art approach, where a different weighting factor can be applied to each hypothesis, and more complicated motion models such as global motion compensation techniques are used.
Although the consideration of multi-hypothesis motion compensation could considerably improve the performance of a video codec, proper estimation of the motion parameters for this case is a very difficult problem. In particular, the optimal solution could be found by examining all possible prediction combinations with the available references that is, examine for each possible motion vector in a reference all other motion vectors and their combinations in the remaining references, which is obviously computationally forbidding.