Transmission and storage of video sequences are employed in many applications including TV broadcasts, internet video streaming services and video conferencing.
Video sequences in a raw format require a very large amount of data to be represented, as each second of a sequence may consist of tens of individual frames and each frame is represented by typically at least 8 bit per pixel, with each frame requiring several hundreds or thousands of pixels. In order to minimize the transmission and storage costs video compression is used on the raw video data. The aim is to represent the original information with as little capacity as possible, that is with as few bits as possible. The reduction of the capacity needed to represent a video sequence will affect the video quality of the compressed sequence, that is its similarity to the original uncompressed video sequence.
State-of-the-art video encoders, such as AVC/H.264, utilizes four main processes to achieve the maximum level of video compression while achieving a desired level of video quality for the compressed video sequence: prediction, transformation, quantization and entropy coding.
The prediction process exploits the temporal and spatial redundancy found in video sequences to greatly reduce the capacity required to represent the data. The mechanism used to predict data is known to both encoder and decoder, thus only an error signal, or residual, must be sent to the decoder to reconstruct the original signal. This process is typically performed on blocks of data (e.g. 8×8 pixels) rather than entire frames. The prediction is typically performed against already reconstructed frames or blocks of pixels belonging to the same frame. The prediction against already constructed frames is motion compensated and may use motion vectors directed forward or backward in time form frames selected to provide better prediction. The motion vectors themselves may be prediction encoded.
The transformation process aims to exploit the correlation present in the residual signals. It does so by concentrating the energy of the signal into few coefficients. Thus the transform coefficients typically require fewer bits to be represented than the pixels of the residual. H.264 uses 4×4 and 8×8 integer type transforms based on the Discrete Cosine Transform (DCT).
The capacity required to represent the data in output of the transformation process may still be too high for many applications. Moreover, it is not possible to modify the transformation process in order to achieve the desired level of capacity for the compressed signal. The quantization process takes care of that, by allowing a further reduction of the capacity needed to represent the signal. It should be noted that this process is destructive, i.e. the reconstructed sequence will look different to the original. The possible range of values for the signal in output to the transformation process is divided into intervals and assigned a quantization value. The transform coefficients are then assigned the quantization value based on which quantization interval they fall into.
The entropy coding process takes all the non-zero quantized transform coefficients and processes them to be efficiently represented into a stream of bits. This requires reading, or scanning, the transform coefficients in a certain order to minimize the capacity required to represent the compressed video sequence.
The above description applies to a video encoder; a video decoder will perform all of the above processes in roughly reverse order. In particular, the transformation process on the decoder side will require the use of the inverse transform being used on the encoder. Similarly, entropy coding becomes entropy decoding and the quantization process becomes scaling. The prediction process is typically performed in the same exact fashion on both encoder and decoder.
The present invention relates to the prediction part of the coding and decoding process.
A key aspect of motion compensated prediction is management of reference pictures, which are previously coded pictures that may be used for prediction of further coded pictures
In an existing scheme, these reference pictures are, for the purpose of motion compensation, organized in lists, either in a single list for the case of single picture predictive (P) coding, also referred to as unproductive coding, or into two lists, for the case of two-picture bipredictive (B) coding. The lists are commonly referred as to L0 (list 0) and L1 (list 1). The composition of L0 and L1 determines selection choices of reference pictures that are available for prediction, where selection of just one reference from one list leads to P prediction, while selecting a pair, where a reference is selected from each of the two lists, leads to B prediction. Note that the bipredictive motion compensation is not only used for predicting from pictures from different temporal directions (past and future), but is also used for predicting from two reference pictures from the same direction. Composition of and ordering within each list is usually signaled in the slice header of the video bit-stream, which determines the available choice in selecting reference pictures for motion compensation.
Depending on which stage of coding is performed, reference pictures are usually identified either by their index in one of the lists L0 or L1 (where a same picture, if it appears in both, can have a different index in those two lists), or by their Picture Oder Count (POC) numbers, which normally correspond to the order in which they are supposed to be displayed (and not necessarily decoded). Here, for the sake of simpler representation and without specifying a specific rule for assigning indices, they will be uniquely identified as Rrefidx where refidx=0, . . . , r−1, where r is number of available reference pictures for the specific current picture.
An example for selecting pictures from lists L0 and L1 will be provided in the following. In the case that L0 and L1 are limited to two elements each, and if there are only two references R0 and R1, i.e. r=2, of which one is in the past and other is in the future from the current picture, the lists would be commonly set to L0={0,1} and L1={1,0} (uses the above defined notation so that lists contain unique reference indices). The process of selecting pictures for motion compensation then is depicted in FIG. 1. For the bipredictive case 4 choices are available (=2 pictures in L0×2 pictures in L1), while for unipredictive only L0 is used, so there are 2 choices.
Selection of pictures for bipredictive motion compensation modes is commonly signaled by first encoding the selection of bipredictive or unipredictive mode, and then the indices of each selected picture in the corresponding list—first L0, and then L1 for bipredictive, or only L0 for unipredictive. If unipredictive mode is signaled with binary 0, and bipredictive mode with binary 1, the codewords corresponding to all choices are shown in Table 1. Note that two choices (“Bi- from R0 and R1” and “Bi- from R1 and R0”) consist of the same selection of pictures, just in a different order.
TABLE 1CodewordChoiceBi- or Uni-L0L1Uni- from R000Uni- from R101Bi- from R0 and R1100Bi- from R0 and R0101Bi- from R1 and R1110Bi- from R1 and R0111
The bipredictive motion is described with a motion vector (MV) field composed of two components, here denoted as Mvf0 and Mvf1, where to a bipredicted picture block B two motion vectors are assigned—Mvf0(B) and Mvf1(B). Motion vectors in Mvf0 point to the references in L0, while motion vectors in Mvf1 point to the references in L1.
Each motion vector is encoded differentially to its predictor. The predictor can be derived in various ways. One common method is Motion Vector Competition (MVC), also known as Advanced Motion Vector Prediction (AMVP), where a list of predictor candidates is constructed by collecting the motion vectors of previously processed blocks in a predefined order. Based on minimal coding cost criteria, the encoder then selects one of the predictors, and transmits its index (from the list of predictors) in the bit-stream. The difference of the selected predictor to the currently encoded vector is subsequently encoded in the bit-stream.
In the following, two examples are given that illustrate the difference between choices when selecting direction of prediction for motion compensation, and also help to introduce the formalization that leads to this invention. Note that in the examples for the sake of clarity only the bipredictive cases will be considered.