Field of the Invention
The invention relates to a method and device for encoding a sequence of digital images and a method and device for decoding a corresponding bitstream.
The invention belongs to the field of digital signal processing, and in particular to the field of video compression using motion compensation to reduce spatial and temporal redundancies in video streams.
Description of the Related Art
Many video compression formats, for example H.263, H.264, MPEG-1, MPEG-2, MPEG-4, SVC, use block-based discrete cosine transform (DCT) and motion compensation to remove spatial and temporal redundancies. They can be referred to as predictive video formats. Each frame or image of the video signal is divided into slices which are encoded and can be decoded independently. A slice is typically a rectangular portion of the frame, or more generally, a portion of an image. A slice may comprise an entire image of the video sequence. Further, each slice is divided into macroblocks (MBs), and each macroblock is further divided into blocks, typically blocks of 8×8 pixels. The encoded frames are of two types: temporal predicted frames (either predicted from one reference frame called P-frames or predicted from two reference frames called B-frames) and non temporal predicted frames (called Intra frames or I-frames).
Temporal prediction consists in finding in a reference frame, either a previous or a future frame of the video sequence, an image portion or reference area which is the closest to the block to encode. This step is known as motion estimation. Next, the difference between the block to encode and the reference portion is encoded (motion compensation), along with an item of motion information relative to the motion vector which indicates the reference area to use for motion compensation.
In order to further reduce the cost of encoding motion information, it has been proposed to encode a motion vector by difference from a motion vector predictor, typically computed from the motion vectors of the blocks surrounding the block to encode.
In H.264, motion vectors are encoded with respect to a median predictor computed from the motion vectors situated in a causal neighbourhood of the block to encode, for example from the blocks situated above and to the left of the block to encode. Only the difference, also called residual motion vector, between the median predictor and the current block motion vector is encoded.
The encoding using residual motion vectors saves some bitrate, but necessitates that the decoder performs the same computation of the motion vector predictor in order to decode the value of the motion vector of a block to decode.
Recently, further improvements have been proposed, such as using a plurality of possible motion vector predictors. This method, called motion vector competition, consists in determining between several motion vector predictors or candidates which motion vector predictor minimizes the encoding cost, typically a rate-distortion cost, of the residual motion information. The residual motion information comprises the residual motion vector, i.e. the difference between the actual motion vector of the block to encode and the selected motion vector predictor, and an item of information indicating the selected motion vector predictor, such as for example an encoded value of the index of the selected motion vector predictor.
In the High Efficiency Video Coding (HEVC) currently in the course of standardization, it has been proposed to use a plurality of motion vector predictors as schematically illustrated in FIG. 1: 3 so-called spatial motion vector predictors V1, V2 and V3 taken from blocks situated in the neighbourhood of the block to encode, a median motion vector predictor computed based on the components of the three spatial motion vector predictors V1, V2 and V3 and a temporal motion vector predictor V0 which is the motion vector of the co-located block in a previous image of the sequence (e. g. block of image N−1 located at the same spatial position as block ‘Being coded’ of image N). Currently in HEVC the 3 spatial motion vector predictors are taken from the block situated to the left of the block to encode (V3), the block situated above (V2) and from one of the blocks situated at the respective corners of the block to encode, according to a predetermined rule of availability. This motion vector predictor selection scheme is called Advanced Motion Vector Prediction (AMVP). In the example of FIG. 1, the vector V1 of the block situated above left is selected.
Finally, a set of 5 motion vector predictor candidates mixing spatial predictors and temporal predictors is obtained. In order to reduce the overhead of signaling the motion vector predictor in the bitstream, the set of motion vector predictors is reduced by eliminating the duplicated motion vectors, i.e. the motion vectors which have the same value. For example, in the illustration of FIG. 1, V1 and V2 are equal, and V0 and V3 are also equal, so only two of them should be kept as motion vector prediction candidates, for example V0 and V1. In this case, only one bit is necessary to indicate the index of the motion vector predictor to the decoder.
A further reduction of the set of motion vector predictors, based on the values of the predictors, is possible. Once the best motion vector predictor is selected and the motion vector residual is computed, it is possible to further eliminate from the prediction set the candidates which would have not been selected, knowing the motion vector residual and the cost optimization criterion of the encoder. A sufficient reduction of the set of predictors leads to a gain in the signaling overhead, since the indication of the selected motion vector predictor can be encoded using fewer bits. At the limit, the set of candidates can be reduced to 1, for example if all motion vector predictors are equal, and therefore it is not necessary to insert any information relative to the selected motion vector predictor in the bitstream.
To summarize, the encoding of motion vectors by difference with a motion vector predictor, along with the reduction of the number of motion vector predictor candidates leads to a compression gain. However, as explained above, for a given block to encode, the reduction of the number of motion vector predictor candidates is based on the values taken by the motion vector predictors of the set, in particular the values of the motion vectors of the neighbouring blocks and of the motion vector of the co-located block. Also, the decoder needs to be able to apply the same analysis of the set of possible motion vector predictors as the encoder, in order to deduce the amount of bits used for indicating the selected motion vector predictor and to be able to decode the index of the motion vector predictor and finally to decode the motion vector using the motion vector residual received. Referring to the example of FIG. 1, the set of motion vector predictors of the block ‘being coded’ is reduced by the encoder to V0 and V1, so the index is encoded on 1 single bit. If the block of image N−1 is lost during transmission, the decoder cannot obtain the value of V0, and therefore cannot find out that V0 and V3 are equal. Therefore, the decoder cannot find how many bits were used for encoding the index of the motion vector predictor for the block ‘being coded’, and consequently the decoder cannot correctly parse the data for the slice because it cannot find where the index encoding stops and the encoding of video data starts.
Therefore, the fact that the number of bits used for signaling the motion vectors predictors depends of the values taken by the motion vector predictors makes the method very vulnerable to transmission errors, when the bitstream is transmitted to a decoder on a lossy communication network. Indeed, the method requires the knowledge of the values of the motion vector predictors to parse the bitstream correctly at the decoder. In case of packet losses, when some motion vector residual values are lost, it is impossible for the decoder to determine how many bits were used to encode an index representing the motion vector predictor has been encoded, and so it is impossible to parse the bitstream correctly. Such an error may propagate causing the decoder's de-synchronization until a following synchronization image, encoded without prediction, is received by the decoder.
It would be desirable to at least be able to parse an encoded bitstream at a decoder even in case of packet losses, so that some re-synchronization or error concealment can be subsequently applied.
It was proposed, in the document JCTVC-C166r1, ‘TE11: Study on motion vector coding (experiment 3.3a and 3.3c)’ by K. Sato, published at the 3rd meeting of the Joint Collaborative Team on Video Coding (JTC-VC) of Guangzhou, 7-15 Oct. 2010, to use only the spatial motion vector predictors coming from the same slice in the predictor set. This solution solves the problem of parsing at the decoder in case of slice losses. However, the coding efficiency is significantly decreased, since the temporal motion vector predictor is no longer used. Therefore, this solution is not satisfactory in terms of compression performance.
Document JCTVC-C257, ‘On motion vector competition’, by Yeping Su and Andrew Segall, published at the 3rd meeting of the Joint Collaborative Team on Video Coding (JTC-VC) of Guangzhou, 7-15 Oct. 2010, proposes signaling separately if the selected motion vector predictor is the temporal predictor, i.e. the motion vector of the co-located block, and, if the selected motion vector predictor is not the temporal predictor, using the motion vector predictor set reduction scheme described above to indicate the selected candidate. However, this proposal fails to achieve the result of ensuring correct parsing at the decoder in some cases. Indeed, it assumes that the spatial motion vector predictors are necessarily known at the decoder. However, a motion vector of a neighbouring block of the block to encode may itself be predicted from a temporal co-located block which has been lost during transmission. In that case, the value of a motion vector of the set of predictors is unknown, and the parsing problem at the decoder occurs.
It is desirable to address one or more of the drawbacks in the related art. Further, it is desirable to provide a method allowing correct parsing at the decoder even in the case of a bitstream corrupted by transmission losses while keeping good compression efficiency.