1. Scalable Video Systems
At present, most video encoders generate a single compressed stream corresponding to an entire encoded sequence. Each customer wishing to exploit the compressed file for decoding and viewing must download or stream the entire compressed file for this purpose.
Now, in a heterogenous system (for example the Internet), not all customers have the same type of access to data in terms of bandwidth and processing capacities, and the screens of the different customers may be very different (for example on an Internet type network, one customer could have an 1024-kb/s ADSL bit rate and a powerful microcomputer (PC) while another will have only modem access or a PDA).
One obvious solution to this problem is to generate several compressed streams corresponding to different levels of bit-rate/resolution of the video sequence: this is known as simulcast. However, this solution is sub-optimal in terms of efficiency and assumes prior knowledge of the characteristics of all the potential customers.
More recently, video encoding algorithms known as scalable algorithms have emerged. These are algorithms with adaptable quality and variable space-time resolution, for which the encoder generates a stream compressed in several layers, each of these layers being nested into the higher-level layer. These algorithms are now being adopted as an amendment to the MPEG4-AVC standard (known as SVC here below in this document).
Such encoders are very useful for all applications in which the generation of a single compressed stream, organised in several layers of scalability, may serve several clients with different characteristics, for example:                video on demand or VOD service (target terminals: UMTS, PC ADSL, TV ADSL etc);        session mobility (resumption on a PDA of a video session begun on TV; or resumption on an UMTS mobile unit of a session begun on GPRS);        session continuity (sharing of the bandwidth with a new application);        high-definition TV (a single encoding to serve SD or standard definition and HD or high-definition customers);        videoconferencing (only one encoding for UMTS/Internet customers);        etc.1.1 Grain of the Scalability        
A scalable video stream can be considered to be a set of sub-streams represented by cubes 11 in a 3D space formed by the three dimensions of space 12, time 13, and quality or SNR 14 (S, T, Q), as illustrated schematically in FIG. 1.
The size of the increments in the different directions corresponds to the “grain” of the scalability: it may be fine, medium (10% of bit rate per increment of scalability: MGS) or coarse (25% bit rate per increment: CGS).
Here below, the CGS scalability shall be deemed to correspond to a “layered” system (as described in the document MPEG2005/M12043, April 2005. “ISO/MP4 File Format for Storage of Scalable Video”, Thomas Rathgen, Peter Amon and Andreas Hutter) and MGS shall be deemed to correspond to a “level-based” system as described in this same document.
1.2 Fine Scalability Mode (MGS)
A scalable bitstream can be organised to support fine scalability. Here below, reference will be made to MGS (medium grain scalability) in compliance with the rule adopted in the MPEG. From the bitstream, any consistent sub-stream can be extracted (including the base level) and decoded with the corresponding quality, i.e. any combination whatsoever of the resolution values supported (time, space or SNR) can be extracted. The MGS bit streams provide for the highest flexibility.
1.3 Layered Scalability Mode
As an alternative, a bitstream can be organised in layers. The term used then is CGS (coarse grain scalability) streams. A layer contains all the scalability levels needed to pass to the higher-quality layer. A layer must increase quality in at least one direction (time, space or SNR).
A CGS representation enables simple adaptation operations, especially at the level of the nodes on the network. The progress of the information in terms of quality, spatial resolution and time resolution are defined a priori according to the conditions dictated by an application or a user of the service.
2. MPEG-4 SVC
The JSVM MPEG is described in the document JSVM 2.0 referred to here above. The model chosen is based on a scalable encoder highly oriented toward AVC type solutions for which the schematic structure of a corresponding encoder is shown in FIG. 2. This is a pyramidal structure. The video input components 20 undergo dyadic sub-sampling (2D decimation by two referenced 21, 2D decimation by four referenced 22). Each of the sub-sampled streams then undergoes a temporal decomposition 23 of the MCTF (motion compensated temporal filtering) type. A low-resolution version of the video sequence is encoded 14 up to a given bit rate R_r0_max which corresponds to the maximum decodable bit rate for the low spatial resolution r0 (this base level is AVC compatible).
The higher levels are then encoded 25, 26 by subtraction of the previous rebuilt and over-sampled layer and encoding of the residues in the form:                of a base level;        possibly one or more enhancement levels obtained by bit-plane multiple-pass encoding (here below called SGS or fine grain scalability). The prediction residue is encoded up to a bit rate R_ri_max which corresponds to the maximum bit rate decodable for the resolution ri.        
More specifically, the MCTF filtering blocks 23 carry out a temporal wavelet filtering operation, i.e. they realign the signals in the sense of the motion before wavelet filtering: they deliver information on motion 27 fed into the motion-encoding block 24-26, and texture information 28 fed into a prediction module 29. The pieces of predicted data, at output of the prediction module 29, serve to carry out an interpolation 210 from the lower level. They are also fed into a spatial conversion and entropic encoding block 211 that works on the refinement levels of the signal. A multiplexing module 212 arranges the different sub-streams generated in a comprehensive compressed data stream.
This novel approach is capable of giving medium grain scalable streams in the time, space and quality dimensions.
The following are the main characteristics of this technique:                pyramidal solution with dyadic sub-sampling of the input components;        temporal decomposition of the MCTF (“Motion Compensated Temporal Filtering”) type at each level;        encoding of the base level (AVC compatible);        encoding of the higher levels by subtraction of the rebuilt previous level and encoding of the residues in the form:                    of a base level; and            of one or more enhancement levels obtained by bit-plane multi-pass encoding (here below: FGS).                        
In order to obtain bit rate adaptation, the pieces of information on texture are encoded by means of a gradual scheme at each level:                encoding of a first minimum quality level (called a base layer);        encoding of gradual refinement levels (called an enhancement layers).3. Architecture of the SVC Encoder        
The SVC encoder is based on a two-layer system, like the AVC:                The VCL (“Video Coding Layer”) manages the encoding of the video;        The NAL (“Network Abstraction Layer”) provides an interface between the VCL and the exterior. In particular, this level organises the different NALU (NAL units or data units) given to it by the VCL, into AUs or access units.                    A NALU is an elementary unit containing the basic level of a space-time image containing all (in the current version) or a part (this is under discussion for future versions) of an FGS level of a space-time image;            An AU is the set of the NALUs corresponding to an instant in time.4. Signalling of the Modes of Scalability in the NAL SVC Layer                        
In order to be able to appropriately decode a NAL SVC, it should be possible to report its position in space (S,T,Q) as illustrated by FIG. 1.
Two signalling modes (corresponding substantially to the CGS and MGS modes) are currently under discussion in the JSVM: a fixed path signalling mode and a variable path signalling mode.
A simple example is given in FIG. 3 for a 2D (T, Q) context: a scalable video stream contains the nested representations of the sub-streams in CIF with temporal resolution of 15 Hz (31) and 30 Hz (32). The stream is formed by four NALUs A(0), B(2), C(1) and D(2). Using these four NALUs, it is possible to obtain the bit-rate/distortion points a, b, c, d. The priority of the NALUs is given between brackets.
In this example, it can be seen that there are several “paths” 33, 34 and 35, of possible decoding between the points a and d: (a,b,c,d) but also (a,b,d) or (a,c,d).
It will be understood that certain applications could give preference to one path over the other.
For example, it may be judicious to use the path (a,c,d) to obtain a fluid 30 Hz video but rather (a,b,d) if it is sought to emphasize quality at 15 Hz.
This path is therefore dependent on the application and the encoding method used.
5.1 Fixed Path
This mode signals a unique path in the 3D space for the extraction of the scalable stream. It is well adapted to the CGS modes for certain applications. In the example of FIG. 3, a fixed path will be chosen, for example (a,c,d).
This path will always be used, whether the decoding is done at 30 Hz or 15 Hz. The major drawback that appears then is that the point b will not be decoded, even at 15 Hz, whereas it could be more worthwhile to decode it to have a better quality at 15 Hz.
In this context, a single indicator enables this fixed path to be encoded.
5.2 Variable Path
This mode proposes to leave the choice of the extraction path to the application (or to the network). To do this, an additional element is introduced. This additional element is called a priority indicator and will enable a resolution of the problem evoked further above when the decoding is done at 30 Hz.
The following two tables respectively show the priorities assigned to the NALUs A, B, C and D in the above example and the result of the filtering at 15 Hz and 30 Hz according to the priority chosen when decoding (the elements decoded/kept are the NALUs that appear in the table). It should be noted that it is possible at 30 Hz to define several different paths as a function of the priority index assigned to the four NALUs.
NALABCDPriority0213
Filtering priority15 Hz30 Hz0AA1AA + C2A + BA + B + C3A + BA + B + C + D
The following are the filtering rules used:                For a target temporal resolution T (30 or 15 Hz) keep all the NALs for which the temporal resolution is smaller than or equal to the requested resolution (T_NAL <=T_Target) and the priority is lower than or equal to the priority requested (priority_NAL<=target_priority).        
The method can be generalised to more than two dimensions: in the example illustrated in FIG. 4A, there are three dimensions of scalability:                D=: QCIF/CIF (the dependency D is a subset of the spatial resolution S) ;        T=Time: 15 Hz/30 Hz;        Q=Quality/low/high complexity (“low/high”).        
An example of association of priorities with the different NALUs is specified in the figure, and an associated filtering example is shown in the table here below.
CIF@ 15 Hz HighQCIF@ 15QCIF@ 30complexity/SNR LowPriority_idHzHzcomplexity/SNRCIF@ 30 Hz0AAAAA1A + EA + (B + E)A + (C + E)A + CA + (B + C + E)2A + EA + (B + E) +A + (C + E) +A + CA + (B + C + E) +FG(D + F + G)3A + EA + (B + E) +A + (C + E) +A + CA + (B + C + E) +FG(D + F + G) + H
In this context, a double indicator is necessary: the “priority field” contains the priority indicator and the “decodability info” contains information on space, time and quality resolution. These two indicators necessitate a representation on two bytes. Thus, throughout the rest of the document, a NAL will be identified by its four indices (P, D, T, Q) accessible for example in the file header where:                P indicates the priority;        D indicates the dependency (superset of the spatial resolution S)        T indicates the time resolution;        Q indicates the quality or the complexity.        
It will be noted that the cells in bold type in the above table show anomalies relative to the requested space-time resolution: in QCIF at 30 Hz, it would be desirable to benefit from the NALU B, and not only from the NALU A which is at 15 Hz.
The invention is aimed inter alia at overcoming this problem.
6. Drawbacks of the Prior Art
The fixed-path solution cannot be used to serve different applications simultaneously because there is only one relationship of dependency between the NALUs coming from a hierarchical encoder.
The multiple-path solution is more open than the fixed-path solution: it enables adaptation to different applications/resolutions in the network or at the decoder level. The increase in complexity remains limited but cannot be adapted to an adaptation of very low complexity as is done on certain network routers for example.
The fixed-path approach does not enable adaptation in time to different conditions on the user side, network side or server side: a customer cannot choose at any instant to give preference to one axis (for example that of temporal fluidity) over another (for example SNR quality) such as the choice dictated by the scheduling defined by the fixed adaptation path.
Neither of the two approaches (the fixed or variable approach) enables the management of a fall-back mode that would enable the user to reduce the quality or the resolution of the data received according to his own preferences.