Video data are generally the subject of source coding aimed at compressing them in order to limit the resources necessary for their transmission and/or storage. Numerous coding standards exist, such as H.264/AVC, H.265/HEVC and MPEG-2, which can be used for this purpose.
A video stream comprising a set of images is considered. In conventional coding schemes, the video stream images to be encoded are typically considered according to an encoding sequence, and each one is divided into sets of pixels, themselves also processed sequentially, for example beginning at the top left and finishing at the bottom right of each image.
Encoding an image from the stream is thus carried out by dividing a matrix of pixels corresponding to the image into several sets, for example blocks of fixed size 16×16, 32×32 or 64×64, and by coding these blocks of pixels according to a given processing sequence. Certain standards, such as H.264/AVC, provide for the possibility of breaking down blocks of size 16×16 (then called macro-blocks) into sub-blocks, for example of size 8×8 or 4×4, in order to carry out the encoding processing with finer granularity.
Existing video compression techniques can be divided into two broad categories: on the one hand the compression known as “Intra” compression, in which the compression processing is carried out on the pixels of a single image or video frame, and on the other hand the compression known as “Inter” compression, in which the compression processing is carried out on several video images or frames. In Intra mode, the processing of a block (or set) of pixels typically comprises a prediction of the pixels of the block carried out using (previously coded) causal pixels present in the image currently being encoded (called “current image”), in which case the term “Intra prediction” is used. In the Inter mode, the processing of a block (or set) of pixels typically comprises a prediction of the pixels of the block carried out using pixels originating from previously encoded images, in which case the term “Inter prediction” or “motion compensation” is used.
These two types of coding are used in the existing video codecs (MPEG2, H.264/AVC, HEVC) and are described for the HEVC codec in the article entitled “Overview of the High Efficiency Video Coding (HEVC) Standard”, by Gary J. Sullivan et al., IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, No. 12, December 2012.
This exploiting of the spatial and/or temporal redundancies makes it possible to avoid transmitting or storing the value of the pixels of each block (or set) of pixels, by representing at least some of the blocks with a residual of pixels representing the difference (or the distance) between the prediction values of the pixels of the block and the actual values of the pixels of the predicted block. The information from the residuals of pixels is present in the data generated by the encoder after transform (for example of the DCT type) and quantification in order to reduce the entropy of the data generated by the encoder.
It is desirable to reduce as far as possible the additional information generated by the prediction of the pixels and present in the encoder output in order to increase the efficiency of a coding/compression scheme at a given distortion level. Conversely, it may also be sought to reduce this additional information in order to increase the efficiency of a coding/compression scheme at a given encoder output rate.
A video encoder typically makes a choice of coding mode corresponding to a selection of the encoding parameters for a processed set of pixels. The taking of this decision can be implemented by optimizing a rate and distortion metric, the encoding parameters selected by the encoder being those that minimize a rate-distortion criterion. The choice of coding mode then has an impact on the performance of the encoder, both in terms of rate gain and visual quality.
In fact, choosing the wrong coding mode can result in artefacts which lead to a degradation of the perceived visual quality. Methods of calculation based on the rate-distortion optimization make it possible to reduce the encoder output rate; nevertheless sometimes at the expense of the visual rendering.
The distortion is in fact calculated using so-called “objective” metrics, such as the sum of absolute differences (SAD), or the mean square error (MSE), which prove to be very weakly correlated with the perceptual quality. In fact, certain video compression methods can improve the visual quality while they degrade the objective metrics.
Distortion metrics based on visual perception have been proposed as alternatives to the objective mathematical measurements. These metrics use a modelling of the known psycho-visual properties of the human visual system, and are called HVS (Human Visual System) metrics. The following may be mentioned as examples of HVS metrics: the JND (Just Noticeable Difference) metric described in the article “Sarnoff JND vision model”, J. Lubin et al., T1A1.5 Working Group Document, T1 Standards Committee, 1997, the DVQ (Digital Video Quality) metric described in the article “Digital video quality metric based on human vision”, Journal of electronic imaging, vol. 10, no. 1, January 2001, pp. 20-29, or also the VSSIM (Video Structural Similarity Index) metric described in the article “Video Quality Assessment Based on Structural Distortion Measurement”, Z. Wang et al., IEEE Signal Proc. Image Communication, vol. 19, no. 2, February 2004, pp. 121-132.
These methods for measuring the visual distortion (also called “subjective distortion”) have the drawback of being very complex, and cannot be envisaged in practice in an encoder. For example, they require too much computing power to be implemented in a real-time video encoder. They are useful only for the purpose of encoding in order to estimate a posteriori the visual quality of an encoded/compressed video by using objective metrics.
Another subjective distortion metric, i.e. based on visual perception, was proposed by A. Bhat et al. in the article “A new perceptual quality metric for compressed video”, with the ambition of integrating the use of the proposed perceptual quality metric into an algorithm for choosing the coding mode of a video codec of the H.264/AVC type. The metric proposed in the article is calculated as follows:MOSpblock=1−kblock(MSEblock)  (1)
where MOSpblock is the perceptual quality of the processed block, kblock a constant calculated as a function of the presence of details in the block, and MSEblock the mean square error. This metric is based on the principle that the artefacts will have a tendency to be more visible in the zones without details, and is thus based solely on the local characteristics of the processed block.
A need therefore exists for an image encoding method that is improved by taking into account the motion in a set of images or a video stream to be encoded.