1. General Principle of Scalable Video Encoding
Today, many data transmission systems are heterogeneous in that they serve a plurality of clients having highly varied types of access to data. Thus, for example, the Internet is accessible both from a personal computer (PC) and from a mobile telephone. More generally, the network accessing bandwidth, the processing capacities of the client terminals and the size of their screens vary greatly from one user to another. Thus, for example, a first client can access the Internet from a powerful PC and have an ADSL (Asymmetric Digital Subscriber Line) bit rate, whereas a second client tries to access the same data at the same point in time from a PDA (Personal Digital Assistant) connected to a modem with a low bit rate.
To meet these different needs, scalable image encoding algorithms have been developed, providing an adaptable quality and a variable space-time resolution. According to these techniques, the encoder generates a compressed stream with a layered structure. For example, a first data layer conveys a 256-kbits/s stream which could be decoded by a PDA type terminal, and a second complementary data layer conveys a stream with a resolution higher than 256 kbits/s which could be decoded, as a complement to the first one, by a more powerful PC type terminal. In this example the bit rate needed to convey these two nested layers is 512 kbits/s.
Such encoding algorithms are thus very useful for all the applications for which the generation of a single compressed stream, organized in several layers of scalability, can serve several clients with different characteristics.
2. The SVC Encoder
The SVC standard corresponding to the amendment number 3 of the AVC H264/MPEG-4 standard, part 10, more particularly defines the structure of scalable video streams. Such a stream comprises a basic layer also called a base level compatible with AVC H264/MPEG-4 standard, part 10, and one or more enhancement layers. The enhancement layers are encoded by prediction relatively to a preceding layer (inter-layer prediction) and, classically, relatively to other images of the sequence (intra prediction or classic time prediction). It may be recalled that, for inter-layer prediction, three types of predictions can be used: motion vector prediction, prediction of residues derived from the prediction, and texture prediction.
More specifically, FIG. 1 illustrates the structure of an SVC encoder of this kind, having three layers of different spatial resolutions (one base resolution level and two higher resolution levels).
The video input components 10 at the highest resolution level and the video input components sub-sampled at least once by 2D spatial decimation (11)) at the lower resolution levels, enter a module 12 implementing operations of time decomposition and motion estimation.
Such a module 12 feeds the motion estimation and compensation modules 13 from motion information 14, and feeds intra-prediction modules 16 from textural information 15.
The data, output from the intra-prediction module 16, feeds a transformation and entropic encoding block 17. The data coming from this block 17 serves especially to achieve a 2D spatial interpolation (18) from the level of lower resolution. Finally, a multiplexing module 19 orders the different sub-streams generated into an overall compressed data stream 20.
In other words, the input sequence is sub-sampled at least once and the SVC encoder performs the following steps:                the base level is encoded with a basic quality;        the enhancement levels are encoded with a higher quality;        the pieces of textural and motion information are refined;        the difference between the different levels of resolution is determined and this difference is encoded (entropic encoding).        
3. The Epitomes
In order to improve image compression or image sequences compression, Q. Wang, R. Hu and Z. Wang in “Improving Intra Coding in H.264\AVC by Image Epitome, Advances in Multimedia Information Processing” have proposed a novel technique of intra prediction for AVC encoders/decoders based on the use of epitomes or jigsaws.
An epitome is a condensed and generally miniature version of an image containing the main components of textures and contours of this image. The size of the epitome is generally reduced as compared with that of the original image but the epitome always contains the constituent elements most relevant for rebuilding of the image. As described in the above-mentioned document, the epitome can be built by using a maximum likelihood estimation (MLE) type of technique associated with an expectation/maximization (EM) type of algorithm. Once the epitome has been built for the image, it can be used to rebuild (synthesize) certain parts of the image.
In particular, in the above-mentioned document, Q. Wang et al. have proposed the generation of an epitome iteratively. A pyramid of epitomes is thus made in which, classically, the epitome obtained by means of a lower level image during one iteration serves, after interpolation, as an initializing value for generating the epitome corresponding to the higher level image during another iteration. The epitome generated from a sub-sampled image is therefore used to generate the epitome associated with the image in its non-sampled version.
Unfortunately, in this approach, the epitome associated with the image in its non-sampled (high-resolution) version is built from a “low-resolution” epitome corresponding to a “degraded” version of the image, in over-sampling this low-resolution epitome. The prediction of the image in its non-sampled version is therefore not of high quality.
Furthermore, the epitomes obtained at different levels of resolution can be very different. Thus, the epitome built directly at the i+1 resolution level can be very different from an epitome built at the i resolution level during a first iteration and over-sampled at the i+1 resolution level. The consistency of the information between the i resolution level and a layer of a higher resolution level (i+1) is therefore not ensured and it is not possible to accurately predict a layer of a level of higher resolution (i+1) relatively to a layer with a level of lower resolution (i). The epitome associated with the image in its non-sampled version according to this approach therefore cannot be used by a scalable video encoder/decoder such as SVC.