Conventional video coding standards (e.g. MPEG-1, H.261/263/264) incorporate motion estimation and motion compensation to remove temporal redundancies between video frames. These concepts are very familiar for skilled readers with a basic understanding of video coding, and will not be described in detail.
The working draft 1.0 of the scalable extension to H.264/AVC [1] currently enables coding of multiple scalable layers with different values of dependency identifications (DependencyId). Accordingly each layer comprises a dependency identification and, for a coded video sequence, respectively a certain sequence parameter set (SPS). A coded video sequence consists of successive coded pictures from an instantaneous decoding refresh (IDR) picture to the next IDR picture, exclusively. Any picture that succeeds an IDR picture in decoding order shall not use inter prediction reference from prior to the IDR picture in decoding order. The sequence parameter set includes among other things data which will be used on the decoder side for a proper decoding operation.
There are two main disadvantages associated with the current coding methods according to the present state of the art. First, if a scalable presentation point with DependencyId equal to 7 is desired, and all the lower layers with DependencyId equal to 0 to 6 are required, then at least 8 sequence parameter sets have to be transmitted for the presentation or decoding operation, respectively. However, if no SPS parameters other than the seq_parameter_id need to be changed which is possible if the spatial resolutions are equal for all layers, then those certain and substantially identical SPSs are actually redundantly transmitted. Since SPSs are typically transmitted in the session beginning in a reliable and out-of-band way, reception acknowledgements are needed and retransmission may be used. Thus, increased amount of data to be sent means longer session setup delay, which is unwanted for the end user experience.
A second disadvantage relates to flexibility and coding efficiency. The maximum number of initial SPSs is 32. If a scalable presentation point with DependencyId equal to 7 is desired, and all the lower layers with DependencyID equal to 0 to 6 are required, then in average coding the layer(s) of each value of DependencyId may have at most only 4 SPS variations. Therefore the flexibility and possibly also coding efficiency is lowered compared to that 32 SPS variations could have been used. Updating an SPS during a video session could solve this problem. However, during a video transport session SPS updating may easily cause problem because of loss of the synchronization between the updated SPS and those NAL units referencing it. In addition, if the update is done using the in-band way, e.g. transmitted using Real-time Transport Protocol (RTP) together with the coded video slices, it may get lost.