At present, many data transmission systems are heterogeneous in the sense that they serve a plurality of customers having many varied types of access to data. Thus, the worldwide Internet for example is accessible from a PC type terminal as well as from a radio-telephone. More generally, the bandwidth for access to the network, the processing capacities of the customer terminals and the size of their screens vary greatly from one user to another. Thus, a first customer may, for example, access the Internet from a powerful PC with an ADSL bit rate of 1024 kbits/s at his disposal while a second customer seeks to access the same data at the same time using a PDA personal digital assistant) type terminal connected to a modem with a low bit rate.
These different users therefore need to be offered a data stream adapted to their requirements that vary in terms of both bit rate and image resolution. This necessity is applicable more broadly to all applications accessible to customers having a wide variety of access and processing capacities, and especially to the following applications:                VOD (“Video On Demand”), accessible to UMTS (“Universal Mobile Telecommunication Service”) type radio-communications terminals, PCs or television terminals with ADSL access.        session mobility (for example resumption on a PDA of a video session begun on a television set or, on an UMTS type mobile, of a session begun on GPRS (“General packet radio service”);        session continuity (in the context of sharing of the bandwidth with a new application);        high-definition television, wherein a single video encoding must provide for service to customers having standard definition (SD) as well as those having high definition (HD);        video-conferencing wherein a single encoding must meet the needs of customers having UMTS access and Internet access;        etc.        
To meet these different requirements, scalable image-encoding algorithms have been developed, enabling adaptable quality and variable space-time resolution. The encoder generates a compressed stream with a hierarchical structure of layers in which each of the layers is embedded in a higher-level layer. For example, a first data layer conveys a 256 kbits/s stream which may be decoded by a PDA type terminal, and a second complementary data layer conveys a stream with a resolution greater than 256 kbits/s which could be decoded, complementarily to the first stream, by a more powerful PC type terminal. The bit rate needed to transport these two embedded layers is, in this example, 512 kbits/s.
Certain of these scalable video-encoding algorithms are now being adopted by the MPEG (“Moving Picture Expert Group”) standard in the context of the MPEG21 working group.
In particular, the model recently chosen by the MPEG-21 Working Group, the SVC (“Scalable Video Coding”) model, is called the SVM (“Scalable Video Model”) and is based on a scalable encoder based on AVC (“Advanced Video Coding”) type solutions. This model is described in detail in the document N6716 ISO/IEC JTC 1/SC 29/WG 11, entitled “Scalable Video Model 3.0”, Oct. 2004, Palma de Majorca, Spain. The MPEG 21 working group is aimed at proposing a standard for the supply of scalable streams that are average-grained in the space-time dimensions and in quality.
2.1 The MPEG-21 SVM Encoder
2.1.1 Main Characteristics of the Encoder
FIG. 1 illustrates the structure of such an encoder, having a pyramid structure. The video input components 10 undergo a dyadic sub-sampling operation (2D decimation by two referenced 11, 2D decimation by four referenced 12). Each of the sub-sampled streams then undergoes an MCTF (motion-compensated temporal filtering) type temporal decomposition 13. A low-resolution version of the video sequence is encoded 14 up to a given bit rate R_r0_max corresponding to the maximum decodable bit rate for the low spatial resolution r0 (this base level is AVC compatible).
The upper levels are then encoded 15, 16 by subtraction of the previous reconstructed and over-sampled level and by encoding the residues in the form of;                a base level;        possibly one or more enhancement levels obtained by multi-ran encoding of bit planes (hereinafter called FGS for “fine-grain scalability”). The prediction residue is encoded up to a bit rate R_ri_max which corresponds to the maximum bit rate decodable for the resolution ri.        
More specifically, the MCTF filtering blocks 13 perform a temporal wavelet filtering, i.e. they realign the signals in the sense of the motion before wavelet filtering: they deliver information on motion 17 fed to a motion-encoding block 14-16 and textural information 18, fed to a prediction module 19. The predicted data output from the prediction module 19 serves for the performance of an interpolation 20 from the lower level. They are also fed to a space transformation and entropic encoding block 21 that works on refinement levels of the signal. A multiplexing module 22 orders the different sub-streams generated in a total compressed data stream.
FIG. 2 illustrates the results obtained by means of the scalable encoder of FIG. 1 in the form of bit-rate/distortion curves represented for different scalable resolutions (CIF/QCIF for “Common Interface Format/Quarter Common Interface Format”, where the CIF corresponds to a TV semi-format, and the QCIF to a TV quarter format) or different temporal resolutions (7.5-30 hz, number of images per second). The y-axis shows the PSNR (“Peak Signal to Noise Ratio”) and the x-axis shows the bit rate expressed in kbits/s. Thus, the curve referenced 23 corresponds to a QCIF spatial resolution with a temporal resolution of 7.5 Hz, the curve referenced 24 corresponds to a QCIF resolution at 15 Hz, the curve referenced 25 to a CIF resolution at 15 Hz, and the curve referenced 26 to a CIF resolution at 30 Hz.
2.1.2 Generation of Information Layers at the Encoder
FIG. 3 illustrates the mechanism of prediction/extraction of the information implemented by the SVM encoder. A more detailed description is given here below of the prediction implemented when encoding. This prediction consists in encoding a layer with a given level n spatial resolution by prediction from data from layers with lower-level spatial resolution.
More specifically, FIG. 3 presents an example of the generation of two successive layers of QCIF and CIF format spatial resolution layers, respectively associated with the bit rate/distortion curves referenced 30 (QCIF format) and 31 (CIF format). Those skilled in the art will have no difficulty in extending this example to the more general case of n>2 spatial layers. As above, the x-axis represents the bit rate expressed in kbits/second and the y-axis represents the PSNR in dB.
For each spatial resolution layer, the encoder encodes the information in the form of two sub-streams: a base sub-stream (sub-layer) called BL (for “base layer”) and a gradual enhancement sub-stream or sub-layer called EL (for “enhancement layer”).
The QCIF format is first of all encoded on all the ranges of values of temporal frequencies and bit rate. There is a base level (BL) 301 and two possible enhancement levels (EL) referenced FGS1 referenced 302 and FGS2 referenced 303 (FGS-for “fine grain scalable”). The enhancement layer EL therefore has the two runs FGS1 302 and FGS2 303. Intermediate refinement points may be obtained when decoding by cutting data packets between PGS1 and FPS2.
The QCIF format is encoded up to a maximum bit rate point 304 which is then used as a reference for prediction during the encoding of the CIF format. This point must be the best one that can be defined for generally optimum functioning of the system.
The CIF format is then encoded by using the highest point of the QCIF curve 304 (i.e. the maximum bit rate point of this curve) as the predictor The CIF information is also encoded in two sub-streams: a base sub-stream (BL) and an enhancement sub-stream (EL), constituted by two runs (FGS1 and FGS2).
FIG. 3 shows that, starting from the maximum QCIF bit rate point 304 and by adding the base layer (BL) 311 of the CIF spatial resolution level, the CIF reference point 312 is reached. This point is not the minimum bit rate point 313 that can be attained at decoding. Starting from this reference point 312, the enhancement layers EL 314 (FGS1) and 315 (FGS2) enable access to other higher CIF bit rate points, up to a maximum CIF bit rate 316.
FIG. 4 summarizes the order of processing of the information as is done at the encoder for any unspecified level n−1 and n spatial layers, where n is an integer. BL represents the base quality sub-layer and EL represents the enhancement quality sub-layer, of a spatial resolution level. Hence, first of all, the level n−1 base sub-layer BL is encoded 41, then the enhancement sub-layer EL of the n−1 level, the base sub-stream BL of the n level spatial resolution is encoded 43, and then the enhancement sub-stream EL. of this n level is encoded 44. The same procedure is performed subsequently for the higher levels of spatial resolution.
2.2 The MPEG-21 SVM Extractor
The extractor, also called a quality adaptation module here below, is the tool which performs the extraction, for the decoder, of the portion of the total data stream generated by the encoder, which corresponds to a given space-time resolution level and a given bit rate.
2.2.1 General Working of a Scalable Stream Extractor
There are two types of scalable encoders:                the non-predictive “naturally scalable” encoders (based for example on a wavelet transformation) which do not specify particular relationships between the decoding points, embedded in one another (this is the case for example with the video encoders proposed by the JPEG2000 standard);        the predictive SVM type encoders which need to build embedding paths. More specifically, to carry out a compressed stream extraction, the extractor of the SVM follows predefined paths, embedded in one another, as shown in FIG. 5.        
In FIG. 5, the x-axis shows the temporal resolution expressed in Hz, the y-axis shows the bit rate (high H, low L) and the z axis shows the spatial resolution (QCIF or CIF). The total data stream 50 generated by the encoder consists of a set of sub-streams represented in the form of cubes, each corresponding to a given space-time resolution and a given bit rate. Thus, to extract the highest bit rate from the QCIF spatial resolution level at 7.5 Hz, the extractor must follow the following extraction path: CaS 30H→CIF 15H→QCIF 15H→QCIF 7.5H (it will be noted that CIF 30H designates for example the stream in the CIF spatial resolution format for a temporal frequency of 30 Hz, with a high bit rate level H).
Similarly, to extract the lowest bit rate of the QCIF at 7.5 Hz, the extractor must follow the path CIF30 H→CIF 15H→CIF 15 L→QCIF 15 L→QCIF 7.5 L.
2.2.2 Operation of the MPEG-21 SVM Extractor
The MPEG-21 SVM extractor works as follows. To decode a video stream at a given bit rate Rt and with a space-time resolution St-Tt, a sub-stream is extracted from the total stream as follows: the base quality layers of all the levels of spatial resolution (from the base level to the target spatial resolution level St) (BLn−1, BLn, . . . ) are extracted for a cost of Rmin, corresponding to the minimum decodable bit rate for the spatial resolution St. After extraction of the base quality sub-streams, the authorized bit rate becomes Rt=Rt-Rmin.
The extractor then goes through the temporal sub-bands of the lower spatial resolutions and extracts the different enhancement layers EL of each sub-band. It makes a loop on the temporal sub-bands of lower spatial resolution and then a loop on the enhancement layers of each temporal sub-band.
Let Rf be the bit rate necessary to extract a quality layer from a temporal sub-band. If the authorized bit rate Rt>Rf, the layer of the sub-band considered is extracted and the bit rate becomes Rt=Rt−Rf. If not, the layer of the sub-band considered is truncated and the extraction is terminated.
If all the layers of the temporal sub-bands of the lower spatial resolutions have been extracted, the extractor examines the sub-bands of the spatial resolution level St. The extractor makes a loop on the FGS quality layers and then on the temporal sub-bands. Rfs denotes the bit rate necessary to extract a quality q layer for all the temporal sub-bands. If the authorized bit rate Rt>Rfs, then the quality q layer of all the sub-bands is extracted and the bit rate becomes Rt=Rt−Rfs. If not, the quality q layer of all the sub-bands is truncated and the extraction is ended.
FIG. 6 shows the order of processing of the information by the extractor, or quality adaptation module. For extraction at a level n spatial resolution n, the extractor first of all goes through all the base quality BL levels of all the spatial levels (QCIF, CIF, etc.) from level 0 to level n, then the enhancement quality layers EL from the lower spatial levels (EL 0) up to n (EL n).
The extraction mechanism can also be illustrated by FIG. 3 described here above with reference to the prediction mechanism, using the bit rate/distortion curves 30 and 31. Here below, we consider the path followed by the extractor of the SVM MPEG-21 along these curves to generate different points of bit rates at decoding.
Thus, to generate a bit rate point in the QCIF format, the extractor first of all retrieves the base layer 301 from the QCIF level. From the QCIF minimum point 305, it is then possible to extract any bit rate point higher than the QCIF minimum point 305 and lower than the maximum bit rate point 304 (which is the one used for the prediction of the spatial resolution layer higher than the CIF format). To do this, the enhancement layer or sub-stream (EL), constituted by the runs FGS1 302 and FGS2 303 is cut according to the allocated bit rate.
To generate a bit rate point in the CIF format, two approaches are possible depending on whether the required bit rate is greater than the bit rate of the reference point 312 or below this reference point.
If the target bit rate is below the bit rate of the CF reference point 312, the extractor retrieves the base layers BL 301 and 311 of the two QCIF and CIF spatial levels, thus leading to the minimum CIF bit rate point 313. Depending on the remaining bit rate, the extractor truncates the enhancement layers EL 302 and 303 of the QCIF spatial resolution level.
If the requested bit rate is higher than the bit rate of the CIF reference point 312, the extractor retrieves the base layers BL 301 and 311 of the CIF and QCIF levels, the enhancement layer EL 302, 303 of the QCIF level and cuts the CIF enhancement layer 314, 315 according to the remaining bit rate.
3. Drawbacks of the Prior Art
The encoding/decoding techniques of the SVM model of the MPEG-21 working group have various drawbacks. The extraction mechanism associated with this technique has many flaws.
First of all, it can be seen that with the order of processing of information in the extractor (i.e. all the base layers BL of spatial levels, then the enhancement layers EL going from the spatial base level to the requested spatial levels), the extraction always follows the same path whatever the bit rate point requested when decoding. Now this path is not always the optimum path for each target bit rate point when decoding.
Furthermore, for each given level of spatial resolution from which a prediction has been made for the encoding of a higher level of spatial resolution, there is a maximum bit rate point which corresponds to the bit rate point used for the prediction. Now, this maximum bit rate point is not always the highest point that it is sought to attain for this level of spatial resolution. Indeed, the prediction point is chosen to minimize the residue of prediction during the encoding of the higher spatial level but does not correspond to a point of very high quality for the current spatial level. It is often desirable or necessary, especially for the low spatial resolutions, to have points available offering an image reconstruction quality higher then the one given by the prediction point.
Finally, one last drawback of the MPEG-21 SVM encoding technique is that, for extraction, at a level n of spatial resolution (in the CIF format for example), of points with bit rate lower than the bit rate of the reference point of this level (the point referenced 312 for example in FIG. 3, i.e. the point obtained by decoding of the base layers BL of the spatial levels 0 to n and of all the refinement layers EL of the levels 0 to n−1), no piece of refinement information of the level n (i.e. no piece of information from the enhancement levels EL 314 and 315 of the CIF level of example) is used.