1. Technical Field
The present invention deals with a method for scalable video coding.
Video coding is a complex procedure, composed of a chain of different operations; motion estimation, space transform, quantization, entropy coding. The first operation, motion estimation, plays a major role in the process, and its efficiency deeply affects the obtainable compression ratio. During such step, in fact, a prediction of the contents of a photogram is computed starting from the adjacent ones, exploiting the high similarity which usually distinguishes subsequent photograms.
2. Description of the Related Art
Herein below, the term “reference photogram” means an already processed photogram so that it can be reconstructed by the decoder. The term “current photogram” means the photogram to be coded, namely the processing object. The reference photogram is modified in order to approximate the current photogram.
The similarity between subsequent photograms can be expressed through “distortions”. The coded flow is composed of the differences between current photogram prediction and current photogram itself, and of additional information which allow the decoder to obtain the same prediction to obtain the perfect reconstruction. In such a way, the coded information energy is minimised, maximising the compression factor.
In traditional standards of the hybrid type (for ex. MPEG-2, H.264/AVC) the most used motion estimation technique is the so-called “block matching”: the whole current photogram is divided into small blocks with variable size and each one of them is associated, on the reference photogram, with the block having the most similar contents and which minimises, therefore, the difference energy. The two small blocks (one on the current photogram and one on the reference photogram) are therefore presumably an identification of the same image portion; such image portion is often subjected to an offset when passing from a photogram to the following one due either to a movement of filmed objects or to the camera movement. Therefore, it is possible to associate every small block of the current image with a two-dimensional vector, which do represents the offset to which such small block has been subjected with respect to the previous photogram. Such two-dimensional vector which identifies the offset is called “motion vector” (MV).
In the coded flow, consequently, reference photograms, MVs and differences between individual blocks and their predictions are inserted.
The use of block matching introduces some visual artefacts on decoded flows with a high compression ratio, but appears as the most efficient method for computing the motion estimation in hybrid coders (namely in coders including the prediction for compensating movement and space compression).
With the advent of new video coding technologies, based on different transforms from the traditional DCT (Discrete Cosine Transform), such as the “wavelet” transform, an efficiency loss is detected due to the use of block matching as motion estimation technique. The wavelet transform, in fact, contrary to DCT, which operates in blocks, is applied on the whole photogram, and the block matching technique therefore introduces discontinuities on small block edges which, in the transformed domain, give origin to high-frequency components. Such components highly limit performances during the quantization step. Therefore, a need arises for a new type of motion representation.
A motion estimating device is known which is based on an alternative approach to block matching, the so-called “optical flow”, which computes the spot distortion of the reference photogram for determining a current photogram prediction without incurring in a block use. The optical flow technique is described for example in B. Horn, B. Schunck, “Determining optical flow”, Artificial Intelligence, no. 17, pp. 185-203, 1981. The optical flow is computed by solving a system of linear equations, whose coefficients are obtained from space and time derivatives of the current photogram, namely from spot differences between adjacent and/or subsequent pixels in time. The solution is a set of two-dimensional vectors, one for every photogram pixel, called “motion field”.
The motion field produced by the optical flow can be determined so that it is regular, or “smooth”, due to the addition of regularization terms in the system of linear equations. The smooth field gives origin to residuals which do not show the typical discontinuities of block matching and are adapted to be decomposed with the wavelet transform.
There are optical flow embodiments for which the motion fields are computed iteratively, so that each iteration determines a field which is inserted as term in a sum of fields. The final sum is the motion field. The article of P. Giaccone, G. Jones, “Spatio-temporal approaches to the computation of optical flow”, Proceedings of the British Machine Vision Conference, 1997, describes for example the use of the optical flow technique with a particular solution for building the first motion field, in which the first motion estimation is based on identification and tracking of some salient points.
It is known to apply multi-resolution motion estimation techniques through optical flow in video coding contexts, as described for example in Moulin: P. Moulin, R. Krishnamurthy and J. Woods, “Multiscale Modeling and Estimation of Motion Fields for Video Coding”, IEEE Transactions on Image Processing, vol. 6, no. 12, pp. 1606-1620, December 1996.
There are in particular motion estimation embodiments through optical flow which use a “coarse-to-fine” procedure, namely a multi-resolution one. Such techniques provide for the construction of motion fields as a sum. Every term of such sum corresponds to a level of a pyramid containing different space resolutions. Purpose of these procedures is overcoming the difficulties encountered by optical flow algorithms in the wide motion computation, namely in determining offsets which exceed a certain number of pixels.
In practice, such techniques operate as follows. The firs terms is composed of the motion field estimated for photograms at the lowest resolution level. The following terms are produced in the following way:
1. One goes up by one resolution level and photograms at such level are considered.
2. A motion field is created expanding through interpolation and scaling the previously-computed field.
3. The reference photogram is deformed with the field created thereby.
4. The motion field is computed which exists between deformed photogram and current photogram. Such field will be a term of this sum, and will be added to what has already been computed.
5. If the maximum resolution level has not been reached, the process is repeated from step 1.
In such technique, therefore, the computed motion fields for lower levels are used as terms in the final field.
The document by Eero P. Simoncelli: “Bayesian multiscale differential optical flow”, in Handbook of Computer Vision and Applications, eds. B. Jähne, H. Haussecker, and P. Geissler, Academic Press, 1999 describes for example a multi-resolution optical flow arrangement, comprising the use of an algorithm which is able to manage the uncertainty belonging to the motion estimation at multiple levels. The motion field at a certain level is modelled as the sum of motion deriving from below levels and a stochastic component. Other documents related to multi-resolution optical flow motion estimation are U.S. Pat. No. 5,680,487 and U.S. Pat. No. 5,241,608.
The optical flow can be applied in a different environment with respect to traditional coding, in particular in a context of Scalable Video Coding (SVC), also called “level” coding. The SVC technique objective is, starting from a single coded flow, to perform a single coding, originating a bitstream from which it is possible to obtain flows with multiple qualities. In fact, from such bitstream it is possible to extract a new bitstream related to a video flow with desired resolution (chosen from a set of possible resolutions), taking into account the space, time (in terms of “frame rate”) and quality (in terms of “bit rate”) dimensions. Arrangements using both hybrid technologies and wavelet-based approaches are known.
The scalable coding is important, for example, for transmitting on noisy channels: in fact, it is possible to protect the most important levels (the basic levels) by transmitting them on channels with better performances. The scalable coding is very useful also on channels with variable bit-rate: when the band is reduced, the less important layers are not transmitted. Another useful application of the scalability consists in the progressive transmission, namely a user can revise a video preview, coded only with basic levels, for example in order to perform a choice in a database; once the decision has been taken, he will be able to receive the video at the best quality.
In scalable coders, in case of space scalability, the motion estimation step is highly complicated and performances are strongly affected by such step. The optimum prediction for lower resolutions, in fact, is not always given by scaling of motion vectors computed at full resolution, due to the loss of details and the appearance of aliasing. The optimum solution, for the motion field associated with each space resolution, cannot be derived from other resolutions. It is therefore impossible to determine the optimum motion for all resolutions by exclusively computing it in a limited set thereof. However, the motion representation inserted in the bitstream must be unique, to avoid an excessive occupation of bits dedicated to the motion field.
It is thereby necessary to find the best compromise able to optimise performances for all affected resolutions.
There are approaches, based on a motion estimation of the block matching type, which provide for the computation of the motion field for each of the provided resolutions, and represent the motion information univocally by inserting in the coded flow a compromise which is able to keep good performances for each scalability level. The following approaches differ depending on the computation modes of the compromise:
In EP0644695A2, starting from the estimation computed on a basic layer and on an enhancement layer, the final motion field computation is performed, by using weight functions guided by the estimation validity (in terms of residual energy) for each level.
D. Taubman, N. Mehrseresht, R. Leung, “SVC Technical Contribution: Overview of recent technology developments at UNSW”, ISO/IEC JTC1/SC29/WG11/M10868, 2004” describes an adaptive process, which assigns a variable weight, depending on the estimation validity, to motion information coming from high-frequency bands computed during space filtering.
The Applicant has observed that, in spite the above documents describe motion estimation algorithms based on multi-resolution analysis with the use of a compromise for different scalability levels, motion estimation always occurs by applying the block matching. The block matching, however, has difficulties if applied to the scalable coding, since it is not known how to realise motion fields which are able to faithfully represent different space resolutions.