According to criteria schematically illustrated in FIG. 1, a system for encoding videos is referred to as scalable if, starting from an original sequence of pictures indicated with SO, it is capable of producing a “scalable” bitstream SBS in that it is susceptible to be (partially) decoded in such a manner to obtain video signals which, with respect to the video obtainable decoding the bitstream entirely, have:                a low quality (quality scalability QS),        a low spatial resolution (spatial scalability SS), and/or        a low temporal resolution (temporal scalability TS).        
An example of a video coding standard capable of bearing the temporal, spatial, and quality scalability is the technique known as Scalable Video Coding (SVC), which defines a set of scalable coding tools in an extension of the H.264/AVC video coding standard. See, for example, H. Schwarz, D. Marpe and T. Wiegand, “Overview of the Scalable Video Coding Extensions of the H.264/AVC Standard”, IEEE Trans. On Circ. and Sys. for Video Tech., vol. 52, pp. 420-434, December 2007.
The difference between a traditional video coding/decoding system and a scalable system is schematically illustrated in FIGS. 2a and 2b. In a traditional system (FIG. 2a), the original video signal IS is input into an encoder E, which outputs a compressed bitstream BS. The bitstream BS is then intended to be decoded in a decoder D in such a manner to obtain an output video sequence OS corresponding to a single representation of the original video signal. The sequence OS has a given quality level, spatial resolution, and temporal resolution, according to the coding parameters used by the encoder to generate the bitstream BS.
In a scalable encoding system, for example, according to the abovementioned SVC standard, as schematically illustrated in FIG. 2b, the encoder (scalable) SE produces a scalable bitstream SBS from which it is possible to extract various sub-streams SST through a system indicated by an extractor EX. The extractor EX receives input parameters FRRQ regarding the quality and the desired spatial/temporal resolution. The extractor EX is capable of extracting from the scalable bitstream SBS sub-streams which, once decoded by a compatible decoder D, produce a representation OS of the original video signal having the desired parameters.
As shown, for example, in FIG. 3, a scalable bitstream typically includes a finite set of representations of the original video signal, coded in the form of a hierarchy of layers (i.e. Layer 0, Layer 1, Layer 2, etc.) with the aim of obtaining a greater coding efficiency with respect to that obtainable by coding the same representations separately through a traditional non-scalable coding system. The extractor EX thus allows the selection of the representation to be decoded among those present in the scalable bitstream.
The compressed data forming the scalable bitstream SBS is organized in a series of packets made up of a “header” (which contains syntax information) and a “payload” (which contains the actual compressed data). The extraction operation is performed by removing from the scalable bitstream the packets not required to obtain the desired representation. The removal occurs without decoding the payload, but simply relying on the information contained in the header of each packet. In a scene change in a video sequence, it is convenient for the coding system to interrupt the classic motion-compensated prediction scheme (typically the widely known I-B-B-P scheme) by dynamically reacting to the situation through a suitable variation of the coding mode of the single pictures, selecting the type of coding (i.e. I, P or B) in an adaptive manner.
A classic approach provides that the P picture successive to the scene change be transformed into an I picture, while the interposed B pictures maintain the same type of coding B. It is also possible to dynamically vary the Intra period, in other words, the distance between two Intra-pictures, and also disable the calculation of the movement between moment of scene change to reduce the computational complexity of the encoding.
When encoding a digital video, the pictures may be encoded in Intra or Inter mode (intra coded and inter coded respectively). In the Intra mode, the picture is encoded independently from the others, i.e. without using motion-compensated prediction. In the Inter mode, the picture is encoded through motion-compensated prediction, using other pictures of the video sequence as a time reference.
To obtain temporal scalability, the motion-compensated prediction structure is arranged as a hierarchy, as shown for exemplification purposes in FIG. 4. A time layer is assigned to each picture of the sequence. Four layers, indicated as L0 (base layer), L1, L2 and L3 are used in the example of FIG. 4. Motion-compensated prediction (indicated with P) is performed under the condition that each picture belonging to a time layer Ln may only use pictures belonging to time layers Lm with m≦n as reference. In this manner, each picture belonging to a generic time layer Ln may be decoded independently from pictures belonging to higher time layers, and thus temporal scaling of the bit-stream may be performed by simply eliminating, from time to time, the data packets corresponding to the higher time layers, given that the decoding of the pictures belonging to the remaining layers shall not be influenced thereby. To perform the temporal scaling of the bitstream, the data packets corresponding to each picture contain, in the header, the information of the time layer to which the picture itself belongs, so that the extractor EX may know which packets of the bitstream are to be discarded, and which are not to be discarded.
Documents such as U.S. Pat. Nos. 6,731,684, 6,307,886, 7,149,250, 7,295,612 or 6,914,937 describe various systems for encoding non-scalable from a time point of view, and generally according to the classic I-B-B-P scheme widely used for MPEG-2 video applications. Thus, these documents describe coding methods based on the picture type I, P or B. Lastly, U.S. Pat. No. 6,480,543 is directed to a method for detecting a scene change in a video sequence.