H.264, also denoted Moving Picture Experts Group-4 (MPEG-4) Advanced Video Coding (AVC), is the state of the art video coding standard. It is a hybrid codec which takes advantages of eliminating redundancy between frames and within one frame and uses a number of compression techniques that give good compression efficiency. The output of the encoding process is video coding layer (VCL) data which is further encapsulated into network abstraction layer (NAL) units prior to transmission or storage.
H.264 is block-based, i.e. a video frame is processed in macroblock (MB) units, which are 16×16 pixel blocks that may be further divided into sub-macroblocks (sMB). In order to minimize the amount of data to be coded, a technology called motion compensation (MC) is done on each non-intra pixel block which uses previously reconstructed pixel values in neighboring frames to predict the pixel values of the current pixel block at its best effort. To get a prediction for the current pixel block, an area that is similar to current pixel block in the reference frame is signaled in the bitstream. Final reconstruction can be made by adding the predicting pixel values together with the residue pixel values. In order to find a best match of current pixel block in a reference frame, motion search is usually done at the encoder side. It tries to find lowest sum of squared differences (SSD) or sum of absolute differences (SAD) between the current pixel block and possible reference pixel blocks. The outcome of the motion search is a reference index signaling which reference frame it refers to and an offset vector called motion vector (MV) pointing to the reference area. MV is an important and consuming component in the video bitstream. For video coded with high Quantization Parameter (QP), it can take up to over 50% of the bitrate.
Motion Vector Coding
MVs are not directly coded into bitstream since there are redundancies to exploit between MVs. Neighboring MVs often have high correlations and MVs with similar length and direction are often clustering together. These clustered MVs could correspond to local motion where an object is moving or global motion where there is a panning. For each MV to be coded, a MV prediction is done first to reduce the amount of data so that only the difference between the MV and the MV predictor is coded. In H.264, a median predictor is generated by taking the median value of the MVs from the pixel block to the left, above and top-right. The process is done for the horizontal and vertical MV component respectively.
Multi-View Video Coding (MVC)
While “traditional” video services provide video in a single representation, i.e. fixed camera position, multi-view video representations has recently gained significant importance. A multi-view representation represents the content from different camera perspectives or views, a particular case being the “stereoscopic video” case, where the scene is captured from two cameras that have the same or a similar distance as the human eye. Using suitable display technologies to present the “stereoscopic” content to the viewer, perception of depth can be provided to the viewer.
MVC is a video coding standard that can be used to compress multi-view video representations. High compression efficiency is achieved by eliminating redundant information between different layers. MVC is based on the AVC standard and consequently MVC shares most of the AVC structure.
MVC Reference Picture List
The major difference between MVC and AVC is the reference picture list handling process. A reference picture list is a collection of pictures that can be used for prediction. They are normally sorted in an order based on how close they are to the current frame. In AVC, all the reference pictures in the list are from the same view. In MVC, apart from reference pictures from the same view, there are also reference pictures from other views. Hence the first step of MVC reference picture list construction process is exactly the same as in AVC, and the difference lies in that inter-view reference pictures are appended afterwards. Due to complexity consideration, it is only allowed to add frames at the same instance in time from other views to the list in MVC.
High Efficiency Video Coding (HEVC)
HEVC is a next generation video coding standard that is currently under standardization process. HEVC aims to substantially improve coding compared to AVC, especially for high resolution video sequences. The initial focus of the HEVC development is on mono video, i.e. a single view.
Motion Vector Competition
Median MV predictor in H.264 is not so efficient in many cases. VCEG Contribution [1] described a new technology denoted as motion vector competition. The key concept of this technology is to take the MV from the neighboring pixel blocks which are often highly correlated to the current MV to form a list of candidate MVs, where neighboring pixel blocks can be either spatial neighbors, i.e. same frame, or temporal neighbors, i.e. different frames. These candidate MVs are scaled according to their temporal distance to their respective reference frames. Only one candidate MV from the list is selected to be the predictor based on rate-distortion (RD) criteria, and the corresponding index entry to the list is transmitted in the bitstream. Motion vector competition in general improves video coding performance as compared to median MV prediction and is therefore suggested for usage in HEVC.
In motion vector competition, the selected candidate MVs generally need to be scaled before being put into the candidate list since it does not necessarily have the same reference distance as the reference distance of the current pixel block for which the MV prediction is made. The term “reference distance” refers to the difference of picture order count (POC) between the frame with the MV and the frame that the MV points to. In FIG. 1, there are seven frames marked by POC 0-6 which is the display order of a video sequence. In the example, frames with POC equal to 0, 1, 3, 4, 5, 6 are already coded frames. Frame with POC=2 is the current frame that is to be coded, and the pixel block in the middle of frame 2 is the current pixel block where the pixel blocks above it are already coded. The current pixel block is testing inter prediction mode which uses reference areas from frame 0 as reference. Three candidate MV predictors are shown in the figure, they are MV B from a spatial neighboring pixel block in the current frame, and MV A and C from temporal collocated blocks before and after the current frame respectively. A scaling factor is used on these candidate MV predictors before they are adopted into the candidate list. The scaling factor formula is:
  scaling  =            CurrDistance      RfDistance        =                  CurrPOC        -        CurrRfPOC                    RfPOC        -        RfRfPOC            
In FIG. 1, CurrDistance=2−0=2. RfDistance equals to 1−0=1, 2−0=2 and 3−6=−3 for MV A, B and C respectively. Therefore the scaling factors for MV A, B and C are 2/1=2, 2/2=1 and −⅔ respectively. Each candidate MV predictor is scaled up or down according to the calculated scaling factor. These scaled MV predictors are shown at the bottom of FIG. 1.
The motion vector competition described above and proposed for HEVC work well for mono video. However, when applying motion vector competition to multi-view sequences in HEVC or indeed MVC problems can occur.
For instance, when applying motion vector competition to a multi-view video sequence, a motion vector can point to a frame with the same POC but in another view or a candidate MV predictor could point to a frame with the same POC in another view. In these cased, the numerator and the denominator, respective, of the above presented scaling formula is zero. This results in a zero scaling factor or an indefinite scaling factor, respectively.
Furthermore, suboptimal compression performance can occur when selecting candidate MV predictors when having the possibility of using not only spatially and temporally neighboring candidate MV predictors but also MVs from other views.
There is, thus, a need for an efficient handling of motion vectors that is adapted for usage in connection with multi-view video.