Video signals are coded for example to enable an efficient transmission or storage of the video signals. These codes are grouped into defined standards defining how to encode and decode such video signals.
Video coding standards include ITU-T H.261, ISO/IEC MPEG-1 Visual, ITU-T H.262 or ISO/IEC MPEG-2 Visual, ITU-T H.263, ISO/IEC MPEG-4 Visual and ITU-T H.264 (also known as ISO/IEC MPEG-4 Advanced Video Coding (AVC) standard). There are currently efforts underway with regards to the development of further video coding standards. One such further standard under development is the scalable video coding (SVC) standard. Another further standard under development is the multi-view video coding (MVC). Both the SVC and MVC standards are intended to add features to the H.264/AVC standard described above.
The latest draft of SVC, the Joint Draft 9.0, is available in JVT-V201, “Joint Draft 9 of SVC Amendment”, 22nd JVT meeting, Marrakech, Morocco, January 2007, available from http://ftp3.1tu.ch/av-arch/jvt-site/2007—01_Marrakech/JVT-V201.zip.
The latest joint draft of MVC is available in JVT-V209, “Joint Draft 2.0 on Multiview Video Coding”, 22nd JVT meeting, Marrakech, Morocco, January 2007, available from http://ftp3.1tu.ch/av-arch/jvt-site/2007—01_Marrakech/JVT-V209.zip.
Video coders/decoders are also known as codecs. In scalable codecs some elements or element groups of the video sequence can be removed without affecting the reconstruction of other parts of the video sequence. Scalable video coding is a desirable feature for many multimedia applications and services used in systems employing decoders with a wide range of processing power. Scalable bit streams may be used for example for rate adaptation of pre-coded unicast streams in a streaming server and for transmission of a single bit stream to terminals having different decoding or display capabilities and/or with different network conditions.
The earliest scalability introduced to video coding standards was temporal scalability with B pictures in MPEG-1 Visual. In the B picture concept, a B picture is bi-predicted from two pictures, one picture preceding the B picture and the other picture succeeding the B picture, both in display order. In addition, a B picture is a non-reference picture, i.e. it is not used for inter-picture prediction reference by other pictures. Consequently, the B pictures may be discarded to achieve temporal scalability with a lower frame rate. The same mechanism was retained in MPEG-2 Video, H.263 and MPEG-4 Visual.
In H.264/AVC, the concept of B pictures or B slices has been changed. The definition of B slice in H.264/AVC is a slice that may be decoded using inter prediction from previously-decoded reference pictures with at most two motion vectors and reference indices to predict the sample values of each block. In H.264/AVC the bi-directional prediction property and the non-reference picture property of the conventional B picture concept of the previous coding standards are no longer valid.
A block in a B slice may be predicted from two reference pictures in the same direction in display order, and a picture consisting of B slices may be referred by other pictures for inter-picture prediction.
In H.264/AVC and its extensions SVC and MVC, temporal scalability may be achieved by using non-reference pictures and/or hierarchical inter-picture prediction structure. Using only non-reference pictures the H.264/AVC, SVC and MVC coding standards are able to achieve similar temporal scalability as using conventional B pictures in MPEG-1/2/4, by discarding non-reference pictures. Hierarchical coding structure can achieve more flexible temporal scalability.
Scalability may be typically implemented by grouping the image frames into a number of hierarchical layers. The image frames coded into the image frames of the base layer comprise only the ones that are compulsory for the decoding of the video information at the receiving end. One or more enhancement layers may be determined above the base layer, each one of the layers improving the quality of the decoded video in comparison with a lower layer. However a meaningful decoded representation can be produced only by decoding certain parts of a scalable bit stream.
In H.264/AVC and other similar coding schemes, decoded pictures used for predicting subsequent coded pictures and for future output are buffered in the decoded picture buffer (DPB). To efficiently utilize the buffer memory, the DPB management processes, including the storage process of decoded pictures into the DPB, the marking process of reference pictures, output and removal processes of decoded pictures from the DPB, may be specified.
The reference picture management process in H.264/AVC may be summarized as follows. The maximum number of reference pictures used for inter prediction, referred to as M, may be indicated in the active sequence parameter set. Thus when a reference picture is decoded, it may be marked as “used for reference”. If the decoding of the reference picture caused more than M pictures marked as “used for reference”, at least one picture must be marked as “unused for reference”. The DPB removal process may then remove pictures marked as “unused for reference” from the DPB if they are not needed for output as well. Each short-term picture may be associated with a variable PicNum that is derived from the syntax element frame_num, and each long-term picture may be associated with a variable LongTermPicNum that is derived form the long_term_frame_idx which is signaled by a memory management control operation (MMCO) command.
There may be two types of operation for reference picture marking: adaptive memory control and sliding window. The operation mode for reference picture marking may be selected on picture basis.
The adaptive memory control method requires the presence of memory management control operation (MMCO) commands in the bitstream. The memory management control operations enable explicit signalling to indicate which pictures are marked as “unused for reference”, assigning long-term indices to short-term reference pictures, storage of the current picture as long-term picture, changing a short-term picture to the long-term picture, and assigning the maximum allowed long-term index for long-term pictures.
The sliding window control method uses a sliding window to store only the latest M pictures marked as “used for reference”. Thus any earlier short-term reference picture that were decoded among the short-term reference pictures that are marked as “used for reference” is then marked as “unused for reference” when the picture is not within the window. In other words, the sliding window operation mode results into first-in-first-out buffering operation among short-term reference pictures.
The hypothetical reference decoder (HRD), specified in Annex C of H.264/AVC, is used to check bitstream and decoder conformances. The HRD contains a coded picture buffer (CPB), an instantaneous decoding process, a decoded picture buffer (DPB), and an output picture cropping block. The CPB and the instantaneous decoding process are specified similarly to any other video coding standard, and the output picture cropping block simply crops those samples from the decoded picture that are outside the signaled output picture extents. The DPB was introduced in H.264/AVC in order to control the required memory resources for decoding of conformant bitstreams. The DPB includes a unified decoded picture buffering process for reference pictures and output reordering. A decoded picture is removed from the DPB when it is no longer used as reference and no longer needed for output. The maximum size of the DPB that bitstreams are allowed to use is specified in the Level definitions (Annex A) of H.264/AVC.
There are two types of conformance for decoders: output timing conformance and output order conformance. For output timing conformance, a decoder must output pictures at identical times compared to the HRD. For output order conformance, only the correct order of output picture is taken into account. The output order DPB is assumed to contain a maximum allowed number of frame buffers. A frame is removed from the DPB when it is no longer used as reference and needed for output. When the DPB becomes full, the earliest frame in output order is output until at least one frame buffer becomes unoccupied.
These memory control methods however are problematic when some highest temporal layers are discarded. The reduction of the highest temporal layers creates gaps in frame_num in the bitstream. Where this occurs, the decoding process generates short-term “non-existing” pictures having the missing frame_num values. Such “non-existing” pictures are handled in the same way as normal short-term reference pictures in the sliding window reference picture marking process.
The amount of memory buffer required for decoding a subset of a temporal scalable bitstream may be less than that for decoding the temporal scalable bitstream itself, however the coding schemes mentioned above in order to be certain of being able to decode any encoded bitstream will define memory and buffer spacing for the temporal scalable bitstream in total.
For example, in the H.264/AVC standard, the required decoded picture buffer (DPB) size for decoding the entire bitstream is specified by the syntax element max_dec_frame_buffering. Consequently, the decoder able to handle the decoding of a subset of a temporal scalable bitstream has to be equipped with extra memory buffer.
Furthermore even if the decoder is equipped with the buffering memory resources for the entire temporal scalable bitstream, it would be desirable that it could allocate exactly the amount of memory that is required for decoding the desired subset of the entire bitstream and use the saved memory resources for other applications.
There is another similar problem. The maximum number of frames reordered for output is also typically signalled for the entire bitstream. For example in the H.264/AVC standard the syntax element num_reorder_frames is used to set the maximum reordered frames for output. However a subset of the bitstream may require fewer frames reordered for output. For example a subset bitstream comprising only key pictures (defined later), the maximum number of frames reordered for output is actually zero as the output order is identical to the output order. In such a system the decoder that decodes a subset of a temporal scalable bitstream would wait for extra pictures to be decoded to start output, which would cause an initial playback delay over the possible playback delay for the subset of the temporal scalable bitstream.