The present invention generally relates to video compression, and more particularly to a video coding method applied to a video sequence and provided for use in a video encoder comprising base layer coding means, provided for receiving said video sequence and generating therefrom base layer signals that correspond to video objects (VOs) contained in the video frames of said sequence and constitute a first bitstream suitable for transmission at a base layer bit rate to a video decoder, and enhancement layer coding means, provided for receiving said video sequence and a decoded version of said base layer signals and generating therefrom enhancement layer signals associated with corresponding base layer signals and suitable for transmission at an enhancement layer bit rate to said video decoder. More precisely, it relates to a method allowing to code the VOs of said sequence and comprising the steps of:
(1) segmenting the video sequence into said VOs;
(2) coding successive video object planes (VOPs) of each of said VOs, said coding step itself comprising sub-steps of coding the texture and the shape of said VOPs, said texture coding sub-step itself comprising a first coding operation without prediction for the VOPs called intracoded or I-VOPs, coded without any temporal reference to another VOP, a second coding operation with a unidirectional prediction for the VOPs called predictive or P-VOPs, coded using only a past or a future I- or P-VOP as a temporal reference, and a third coding operation with a bidirectional prediction for the VOPs called bidirectional predictive or B-VOPs, coded using both past and future I- or P-VOPs as temporal references.
The invention also relates to computer executable process steps stored on a computer readable medium and provided for carrying out such a coding method, to a corresponding computer program product, and to a video encoder carrying out said method.
In an encoder according to the MPEG-4 standard (said standard being described for instance in the document xe2x80x9cOverview of the MPEG-4 Version 1 Standardxe2x80x9d, ISO/IEC JTC1/SC29/WG11 N1909, October 1997, Fribourg), three types of pictures are used :intra-coded (I) pictures, coded independently from other pictures, predictively-coded (P) pictures, predicted from a past reference picture (I or P) by motion compensated prediction, and bidirectionally predictively-coded (B) pictures, predicted from a past and a future reference picture (I or P). The I pictures are the most important, since they are reference pictures and can provide access points (in the bitstream) where decoding can begin without any reference to previous pictures (in such pictures, only the spatial redundancy is eliminated). By reducing both spatial and temporal redundancy, P-pictures offer a better compression compared to I-pictures which reduce only the spatial redundancy. B-pictures offer the highest degree of compression.
In MPEG-4, several structures are used, for example the video objects (VOs), which are entities that a user is allowed to access and manipulate, and the video object planes (VOPs), which are instances of a video object at a given time. In an encoded bitstream, different types of VOPs can be found: intra coded VOPs, using only spatial redundancy (the most expensive in terms of bits), predictive coded VOPs, using motion estimation and compensation from a past reference VOP, and bidirectionally predictive coded VOPs, using motion estimation and compensation from past and future reference VOPs.
For P-VOPs and B-VOPs, only the difference between the current VOP and its reference VOP(s) is coded. Only P- and B-VOPs are concerned by the motion estimation, carried out according to the so-called xe2x80x9cBlock Matching Algorithmxe2x80x9d: for each macroblock of the current frame, the macroblock which matches the best in the reference VOP is sought in a predetermined search zone, and a motion vector MV is then calculated. The resemblance criterion is given by the Sum of Absolute Differences (SAD). For a Nxc3x97N macroblock, SAD is expressed as:   SAD  =            ∑              i        =        0                    N        xc3x97        N              ⁢          |                        A          ⁡                      (            i            )                          -                  B          ⁡                      (            i            )                              |      
Thus the chosen macroblock is the one corresponding to the smallest SAD among those calculated in the search zone. For said estimation, different modes exist, depending on the type of the frame:
(a) for P-VOPs macroblocks, only the xe2x80x9cforward modexe2x80x9d (use of a past reference I-VOP or P-VOP) is available;
(b) for B-VOPs macroblocks, four modes are available for the macroblock estimation:
xe2x80x9cforward modexe2x80x9d (as for P-VOPs);
xe2x80x9cbackward modexe2x80x9d: as the forward mode, except that the reference is no longer a past one but a future P- or I-VOP;
xe2x80x9cinterpolated modexe2x80x9d or xe2x80x9cbidirectional modexe2x80x9d: it combines the forward and backward modes and uses a past and a future reference VOP;
xe2x80x9cdirect modexe2x80x9d: each motion vector is calculated thanks to the motion vector of the future reference VOP and thanks to the temporal distance between the different VOPs.
Within MPEG-4, an important functionality, the scalability, is offered. Scalable coding, also known as xe2x80x9clayered codingxe2x80x9d, allows to generate a coded representation in a manner that enables a scalable decoding operation. Scalability is the property of a bitstream to allow decoding of appropriate subsets of data leading to the generation of complete pictures of resolution and/or quality that commensurate with the proportion of the bitstream decoded. Such a functionality is useful in the numerous applications that require video sequences to be simultaneously available at a variety of resolutions and/or quality and/or complexity. Indeed, if a bitstream is scalable, one user will access only a portion of it to provide basic video in accordance with his own decoder or display, or with the available bandwidth, while another one will use the full bitstream to produce a better video quality.
The advantage of scalability, which costs less in terms of coding process than the solution according to which several independent bitstreams are coded, is that it allows to deliver a bitstream separable into at least two different bitstreams (and, among them, one with a higher bitrate than the others). Each type of scalability therefore involves more than one layer. In the case of temporal scalability, at least two layers consisting of a lower layer and a higher layer are considered. The lower layer is referred to as the base layer, encoded at a given frame rate, and the additional layer is called the enhancement layer, encoded to provide the information missing in the base layer (in order to form a video signal with a higher frame rate) and thus to provide a higher temporal resolution at the display side. A decoder may decode only the base layer, which corresponds to the minimum amount of data required to decode the video stream, or also decode the enhancement layer (in addition to the base layer), said enhancement layer corresponding to the additional data required to provide an enhanced video signal, and then output more frames per second if a higher resolution is required.
As said above, the MPEG-4 video standard includes a predictive coding scheme. When a scene-cut occurs, it is therefore much more efficient to code the first VOP which immediately follows said scene-cut as an I-VOP, instead of trying to predict it from the preceding VOP, which is completely different from it. In case of temporal scalability, the problem is more complex, since the scene-cut may occur between two VOPs of the enhancement layer and it has still to be handled in the base layer. If the first VOP is coded as an I-VOP on each layer, this leads to a waste of bits and to a loss of coding efficiency.
It is therefore an object of the invention to propose a coding method allowing to reduce said loss of coding efficiency in the scene-cut situations.
To this end, the invention relates to a coding method such as defined in the introductory part of the description and which is moreover characterized in that the temporal references of the enhancement layer VOPs are selected, when a scene cut occurs and said enhancement layer VOPs are located between the last base layer VOP of a scene and the first base layer VOP of the following scene, according to the following specific processing rules:
(A) VOPs located before the scene cut:
(a) no constraint is applied to the coding type;
(b) the use of the next VOP in display order of the base layer as a temporal reference is forbidden;
(B) the VOP located just immediately after the scene cut:
(a) P coding time is enforced;
(b) the next VOP in display order of the base layer is used as a temporal reference;
(C) other VOPs located after the scene cut:
(a) no constraint is applied to the coding type;
(b) the use of the previous VOP in display order of the base layer as a temporal reference is forbidden.
The main advantage of this solution is that it allows to encode only one intra VOP while avoiding non efficient inter-scene predictions.
The invention also relates to computer executable process steps stored on a computer readable medium and provided for carrying out such a video coding method, and to a computer program product comprising a set of instructions, which, when loaded into an encoder as described, causes it to carry out the steps of this method. It also relates to a video encoder comprising base layer coding means, receiving a video sequence and generating therefrom base layer signals that correspond to video objects (VOs) contained in the video frames of said sequence and constitute a first bitstream suitable for transmission at a base layer bit rate to a video decoder, and enhancement layer coding means, receiving said video sequence and a decoded version of said base layer signals and generating therefrom enhancement layer signals associated with corresponding base layer signals and suitable for transmission at an enhancement layer bit rate to said video decoder, said video encoder comprising:
(1) means for segmenting the video sequence into said VOs;
(2) means for coding the texture and the shape of successive video object planes (VOPs), the texture coding means performing a first coding operation without prediction for the VOPs called intracoded or I-VOPs, coded without any temporal reference to another VOP, a second coding operation with a unidirectional prediction for the VOPs called predictive or P-VOPs, coded using only a past or a future I- or P-VOP as a temporal reference, and a third coding operation with a bidirectional prediction for the VOPs called bidirectional predictive or B-VOPs, coded using both past and future I- or P-VOPs as temporal references, characterized in that the temporal references of the enhancement layer VOPs are selected, when a scene cut occurs and said enhancement layer VOPs are located between the last base layer VOP of a scene and the first base layer VOP of the following scene, according to the following specific processing rules:
(A) VOPs located before the scene cut:
(a) no constraint is applied to the coding type;
(b) the use of the next VOP in display order of the base layer as a temporal reference is forbidden;
(B) the VOP located just immediately after the scene cut:
(a) P coding time is enforced;
(b) the next VOP in display order of the base layer is used as a temporal reference;
(C) other VOPs located after the scene cut:
(a) no constraint is applied to the coding type;
(b) the use of the previous VOP in display order of the base layer as a temporal reference is forbidden.