In video coding there is often a significant amount of temporal correlation across pictures/frames. Most video coding standards including the up-coming high efficiency video coding (HEVC) standard exploits this temporal correlation to achieve better compression efficiency for video bitstreams. Some terms used with respect to HEVC are provided in the paragraphs that follow.
A picture is an array of luma samples in monochrome format or an array of luma samples and two corresponding arrays of chroma samples in 4:2:0, 4:2:2, and 4:4:4 colour format.
A coding block is an N×N block of samples for some value of N. The division of a coding tree block into coding blocks is a partitioning
A coding tree block is an N×N block of samples for some value of N. The division of one of the arrays that compose a picture that has three sample arrays or of the array that compose a picture in monochrome format or a picture that is coded using three separate colour planes into coding tree blocks is a partitioning.
A coding tree unit (CTU) a coding tree block of luma samples, two corresponding coding tree blocks of chroma samples of a picture that has three sample arrays, or a coding tree block of samples of a monochrome picture or a picture that is coded using three separate colour planes and syntax structures used to code the samples. The division of a slice into coding tree units is a partitioning.
A coding unit (CU) is a coding block of luma samples, two corresponding coding blocks of chroma samples of a picture that has three sample arrays, or a coding block of samples of a monochrome picture or a picture that is coded using three separate colour planes and syntax structures used to code the samples. The division of a coding tree unit into coding units is a partitioning.
Prediction is defined as an embodiment of the prediction process.
A prediction block is a rectangular M×N block on which the same prediction is applied. The division of a coding block into prediction blocks is a partitioning.
A prediction process is the use of a predictor to provide an estimate of the data element (e.g. sample value or motion vector) currently being decoded.
A prediction unit (PU) is a prediction block of luma samples, two corresponding prediction blocks of chroma samples of a picture that has three sample arrays, or a prediction block of samples of a monochrome picture or a picture that is coded using three separate colour planes and syntax structures used to predict the prediction block samples.
A predictor is a combination of specified values or previously decoded data elements (e.g. sample value or motion vector) used in the decoding process of subsequent data elements.
A tile is an integer number of coding tree blocks co-occurring in one column and one row, ordered consecutively in coding tree block raster scan of the tile. The division of each picture into tiles is a partitioning. Tiles in a picture are ordered consecutively in tile raster scan of the picture.
A tile scan is a specific sequential ordering of coding tree blocks partitioning a picture. The tile scan order traverses the coding tree blocks in coding tree block raster scan within a tile and traverses tiles in tile raster scan within a picture. Although a slice contains coding tree blocks that are consecutive in coding tree block raster scan of a tile, these coding tree blocks are not necessarily consecutive in coding tree block raster scan of the picture.
A slice is an integer number of coding tree blocks ordered consecutively in the tile scan. The division of each picture into slices is a partitioning. The coding tree block addresses are derived from the first coding tree block address in a slice (as represented in the slice header).
A B slice or a bi-predictive slice is a slice that may be decoded using intra prediction or inter prediction using at most two motion vectors and reference indices to predict the sample values of each block.
A P slice or a predictive slice is a slice that may be decoded using intra prediction or inter prediction using at most one motion vector and reference index to predict the sample values of each block.
A reference picture list is a list of reference pictures that is used for uni-prediction of a P or B slice. For the decoding process of a P slice, there is one reference picture list. For the decoding process of a B slice, there are two reference picture lists (list 0 and list 1).
A reference picture list 0 is a reference picture list used for inter prediction of a P or B slice. All inter prediction used for P slices uses reference picture list 0. Reference picture list 0 is one of two reference picture lists used for bi-prediction for a B slice, with the other being reference picture list 1.
A reference picture list 1 is a reference picture list used for bi-prediction of a B slice. Reference picture list 1 is one of two reference picture lists used for bi-prediction for a B slice, with the other being reference picture list 0.
A reference index is an index into a reference picture list.
A picture order count (POC) is a variable that is associated with each picture that indicates the position of the associated picture in output order relative to the output order positions of the other pictures in the same coded video sequence.
A long-term reference picture is a picture that is marked as “used for long-term reference”.
To exploit the temporal correlation in a video sequence, a picture is first partitioned into smaller collection of pixels. In HEVC this collection of pixels is referred to as a prediction unit. A video encoder then performs a search in previously transmitted pictures for a collection of pixels which is closest to the current prediction unit under consideration. The encoder instructs the decoder to use this closest collection of pixels as an initial estimate for the current prediction unit. It may then transmit residue information to improve this estimate. The instruction to use an initial estimate is conveyed to the decoder by means of a signal that contains a pointer to this collection of pixels in the reference picture. More specifically, the pointer information contains an index into a list of reference pictures which is called the reference index and the spatial displacement vector (or motion vector) with respect to the current prediction unit. In some examples, the spatial displacement vector is not an integer value, and as such, the initial estimate corresponds to a representation of the collection of pixels.
To achieve better compression efficiency an encoder may alternatively identify two collections of pixels in one or more reference pictures and instruct the decoder to use a linear combination of the two collections of pixels as an initial estimate of the current prediction unit. An encoder will then need to transmit two corresponding pointers to the decoders each containing a reference index into a list and a motion vector. In general a linear combination of one or more collections of pixels in previously decoded pictures is used to exploit the temporal correlation in a video sequence.
When one temporal collection of pixels is used to obtain the initial estimate we refer to the estimation process as uni-prediction. Whereas, when two temporal collections of pixels are used to obtain the initial estimate we refer to the estimation process as bi-prediction. To distinguish between the uni-prediction and bi-prediction case an encoder transmits an indicator to the decoder. In HEVC this indicator is called the inter-prediction mode. Using this motion information a decoder may construct an initial estimate of the prediction unit under consideration.
To summarize, the motion information assigned to each prediction unit within HEVC consists of the following three pieces of information:                the inter-prediction mode        the reference indices (for list 0 and/or list 1). In an example, list 0 is a first list of reference pictures, and list 0 is a second list of reference pictures, which may have a same combination or a different combination of values than the first list.        the motion vector (for list 0 and/or list 1)        
It is desirable to communicate this motion information to the decoder using a small number of bits. It is often observed that motion information carried by prediction units are spatially correlated, i.e. a prediction unit will carry the same or similar motion information as the spatially neighboring prediction units. For example a large object like a bus undergoing translational motion within a video sequence and spanning across several prediction units in a picture/frame will typically contain several prediction units carrying the same motion information. This type of correlation is also observed in co-located prediction units of previously decoded pictures. Often it is bit-efficient for the encoder to instruct the decoder to copy the motion information from one of these spatial or temporal neighbors. In HEVC, this process of copying motion information may be referred to as the merge mode of signaling motion information.
At other times the motion vector may be spatially and/or temporally correlated but there exists pictures other than the ones pointed to by the spatial/temporal neighbors which carry higher quality pixel reconstructions corresponding to the prediction unit under consideration. In such an event, the encoder explicitly signals all the motion information except the motion vector information to the decoder. For signaling the motion vector information, the encoder instructs the decoder to use one of the neighboring spatial/temporal motion vectors as an initial estimate and then sends a refinement motion vector delta to the decoder.
In summary, for bit efficiency HEVC uses two possible signaling modes for motion information:                Merge Mode        Explicit signaling along with advanced motion vector        