This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
In multi-view video coding, video sequences output from different cameras, each corresponding to different views of a scene, are encoded into one bitstream. After decoding, to display a certain view, the decoded pictures belonging to that view are reconstructed and displayed. It is also possible for more than one view to be reconstructed and displayed.
Multiview video coding possesses a wide variety of applications, including free-viewpoint video/television, three dimensional (3D) TV and surveillance applications. Currently, the Joint Video Team (JVT) of the International Organization for Standardization (ISO)/International Engineering Consotium (IEC) Motion Picture Expert Group (MPEG) and International Telecommunication Union (ITU)-T Video Coding Expert Group is working to develop a multiview video coding (MVC) standard, which is becoming an extension of the ITU-T H.264 standard, also known as ISO/IEC MPEG-4 Part-10. These draft standards as referred to herein as MVC and AVC, respectively. The latest draft of the MVC standard is described in JVT-T208, “Joint Multiview Video Model (JMVM) 1.0”, 20th JVT meeting, Klagenfurt, Austria, July 2006, can be found at ftp3.itu.ch/av-arch/jvt-site/2006—07_Klagenfurt/JVT-T208.zip, and is incorporated herein by reference in its entirety.
In JMVM 1.0, for each group of pictures (GOP), pictures of any view are contiguous in decoding order. This is depicted in FIG. 1, where the horizontal direction denotes time (with each time instant being represented by Tm) and the vertical direction denotes view (with each view being represented by Sn). Pictures of each view are grouped into GOPs, e.g. pictures T1 to T8 in FIG. 1 for each view form a GOP. This decoding order arrangement is referred to as view-first coding. It should be noted that, for the pictures in one view and in one GOP, although their decoding order is continuous without any other pictures to be inserted between any two of the pictures, internally their decoding order may change.
It is also possible to have a different decoding order than that discussed for first-view coding. For example, pictures can be arranged such that pictures of any temporal location are contiguous in decoding order. This arrangement is shown in FIG. 2. This decoding order arrangement is referred to as time-first coding. It should also be noted that the decoding order of access units may not be identical to the temporal order.
A typical prediction structure (including both inter-picture prediction within each view and inter-view prediction) for multi-view video coding is shown in FIG. 2, where predictions are indicated by arrows, and the pointed-to object using the pointed-from object for prediction reference. Inter-picture prediction within one view is also referred to as temporal prediction, intra-view prediction, or, simply, inter prediction.
An Instantaneous Decoding Refresh (IDR) picture is an intra-coded picture that causes the decoding process to mark all reference pictures as “unused for reference” immediately after decoding the IDR picture. After the decoding of an IDR picture, all following coded pictures in decoding order can be decoded without inter prediction from any picture decoded prior to the IDR picture.
In AVC and MVC, coding parameters that remain unchanged through a coded video sequence are included in a sequence parameter set. In addition to parameters that are essential to the decoding process, the sequence parameter set may optionally contain video usability information (VUI), which includes parameters that are important for buffering, picture output timing, rendering, and resource reservation. There are two structures specified to carry sequence parameter sets—the sequence parameter set NAL unit containing all the data for AVC pictures in the sequence, and the sequence parameter set extension for MVC. A picture parameter set contains such parameters that are likely to be unchanged in several coded pictures. Frequently changing picture-level data is repeated in each slice header, and picture parameter sets carry the remaining picture-level parameters. H.264/AVC syntax allows many instances of sequence and picture parameter sets, and each instance is identified with a unique identifier. Each slice header includes the identifier of the picture parameter set that is active for the decoding of the picture that contains the slice, and each picture parameter set contains the identifier of the active sequence parameter set. Consequently, the transmission of picture and sequence parameter sets does not have to be accurately synchronized with the transmission of slices. Instead, it is sufficient that the active sequence and picture parameter sets be received at any moment before they are referenced, which allows for transmission of parameter sets using a more reliable transmission mechanism compared to the protocols used for the slice data. For example, parameter sets can be included as a MIME parameter in the session description for H.264/AVC Real-Time Protocol (RTP) sessions. It is recommended to use an out-of-band reliable transmission mechanism whenever it is possible in the application in use. If parameter sets are transmitted in-band, they can be repeated to improve error robustness.
As discussed herein, an anchor picture is a coded picture in which all slices reference only slices with the same temporal index, i.e., only slices in other views and not slices in earlier pictures of the current view. An anchor picture is signaled by setting an anchor_pic_flag to 1. After decoding the anchor picture, all subsequent coded pictures in display order are capable of being decoded without inter-prediction from any picture decoded prior to the anchor picture. If a picture in one view is an anchor picture, then all pictures with the same temporal index in other views are also anchor pictures. Consequently, the decoding of any view can be initiated from a temporal index that corresponds to anchor pictures.
Picture output timing, such as output timestamping, is not included in the integral part of AVC or MVC bitstreams. However, a value of picture order count (POC) is derived for each picture and is non-decreasing with increasing picture position in output order relative to the previous IDR picture or a picture containing a memory management control operation marking all pictures as “unused for reference.” POC therefore indicates the output order of pictures. It is also used in the decoding process for implicit scaling of motion vectors in the direct modes of bi-predictive slices, for implicitly derived weights in weighted prediction, and for reference picture list initialization of B slices. Furthermore, POC is also used in the verification of output order conformance.
Values of POC can be coded with one of the three modes signaled in the active sequence parameter set. In the first mode, the selected number of least significant bits of the POC value is included in each slice header. In the second mode, the relative increments of POC as a function of the picture position in decoding order in the coded video sequence are coded in the sequence parameter set. In addition, deviations from the POC value derived from the sequence parameter set may be indicated in slice headers. In the third mode, the value of POC is derived from the decoding order by assuming that the decoding and output order are identical. In addition, only one non-reference picture can occur consecutively when the third mode is used.
nal_ref_idc is a 2-bit syntax element in the NAL unit header. The value of nal_ref_idc indicates the relevance of the NAL unit for reconstruction of sample values. Non-zero values of nal_ref_idc must be used for coded slice and slice data partition NAL units of reference pictures, as well as for parameter set NAL units. The value of nal_ref_idc must be equal to 0 for slices and slice data partitions of non-reference pictures and for NAL units that do not affect the reconstruction of sample values, such as supplemental enhancement information NAL units. In the H.264/AVC high-level design, external specifications (i.e. any system or specification using or referring to H.264/AVC) were permitted to specify an interpretation to the non-zero values of nal_ref_idc. For example, the RTP payload format for H.264/AVC, Request for Comments (RFC) 3984 (which can be found at www.ietf.org/rfc/rfc3984.txt and is incorporated herein by reference) specified strong recommendations on the use of nal_ref_idc. In other words, some systems have established practices to set and interpret the non-zero nal_ref_idc values. For example, an RTP mixer might set nal_ref_idc according to the NAL unit type, e.g. nal_ref_idc is set to 3 for IDR NAL units. As MVC is a backward-compatible extension of the H.264/AVC standard, it is desirable that existing H.264/AVC-aware system elements also be capable of handling MVC streams. It is therefore undesirable for the semantics of particular non-zero value of nal_ref_idc to be specified differently in the MVC specification compared to any other non-zero value of nal_ref_idc.
Decoded pictures used for predicting subsequent coded pictures and for future output are buffered in a decoded picture buffer (DPB). To efficiently utilize the buffer memory, the DPB management processes, including the storage process of decoded pictures into the DPB, the marking process of reference pictures, output and removal processes of decoded pictures from the DPB, should be specified.
The process for reference picture marking in AVC is generally as follows. The maximum number of reference pictures used for inter prediction, referred to as M, is indicated in the active sequence parameter set. When a reference picture is decoded, it is marked as “used for reference.” If the decoding of the reference picture causes more than M pictures to be marked as “used for reference,” then at least one picture must be marked as “unused for reference.” The DPB removal process would then remove pictures marked as “unused for reference” from the DPB if they are not needed for output as well.
There are two types of operations for the reference picture marking: adaptive memory control and sliding window. The operation mode for reference picture marking is selected on a picture basis. The adaptive memory control requires the presence of memory management control operation (MMCO) commands in the bitstream. The memory management control operations enable the explicit signaling of which pictures are marked as “unused for reference,” the assigning long-term indices to short-term reference pictures, the storage of the current picture as long-term picture, the changing of a short-term picture to the long-term picture, and the assigning of the maximum allowed long-term index (MaxLongTermFrameIdx) for long-term pictures. If the sliding window operation mode is in use and there are M pictures marked as “used for reference,” then the short-term reference picture that was the first decoded picture among those short-term reference pictures that are marked as “used for reference” is marked as “unused for reference.” In other words, the sliding window operation mode results in a first-in/first-out buffering operation among short-term reference pictures.
Each short-term picture is associated with a variable PicNum that is derived from the frame_num syntax element. Each long-term picture is associated with a variable LongTermPicNum that is derived form the long_term_frame_idx_syntax element, which is signaled by MMCO command. PicNum is derived from the FrameNumWrap syntax element, depending on whether frame or field is coded or decoded. For frames where PicNum equals to FrameNumWrap, FrameNumWrap is derived from FrameNum, and FrameNum is derived directly from frame_num. For example, in AVC frame coding, FrameNum is assigned the same value as frame_num, and FrameNumWrap is defined as follows:
if( FrameNum > frame_num )FrameNumWrap = FrameNum − MaxFrameNumelseFrameNumWrap = FrameNum
LongTermPicNum is derived from the long-term frame index (LongTermFrameIdx) assigned for the picture. For frames, LongTermPicNum equals to LongTermFrameIdx. frame_num is a syntax element in each slice header. The value of frame_num for a frame or a complementary field pair essentially increments by one, in modulo arithmetic, relative to the frame_num of the previous reference frame or reference complementary field pair. In IDR pictures, the value of frame_num is zero. For pictures containing a memory management control operation marking all pictures as “unused for reference,” the value of frame_num is considered to be zero after the decoding of the picture.
The MMCO commands use PicNum and LongTermPicNum for indicating the target picture for the command as follows. To mark a short-term picture as “unused for reference,” the PicNum difference between the current picture p and the destination picture r is signaled in the MMCO command. To mark a long-term picture as “unused for reference,” the LongTermPicNum of the to-be-removed picture r is signaled in the MMCO command. To store the current picture p as a long-term picture, a long_term_frame_idx is signaled with the MMCO command. This index is assigned to the newly stored long-term picture as the value of LongTermPicNum. To change a picture r from being a short-term picture to a long-term picture, a PicNum difference between current picture p and picture r is signaled in the MMCO command, the long_term_frame_idx is signaled in the MMCO command, and the index is assigned to the this long-term picture.
When multiple reference pictures could be used, each reference picture must be identified. In AVC, the identification of a reference picture used for a coded block is as follows. First, all the reference pictures stored in the DPB for prediction reference of future pictures is either marked as “used for short-term reference” (short-term pictures) or “used for long-term reference” (long-term pictures). When decoding a coded slice, a reference picture list is constructed. If the coded slice is a bi-predicted slice, then a second reference picture list is also constructed. A reference picture used for a coded block is then identified by the index of the used reference picture in the reference picture list. The index is coded in the bitstream when more than one reference picture may be used.
The reference picture list construction process is as follows. For simplicity, it is assumed that only one reference picture list is needed. First, an initial reference picture list is constructed including all of the short-term and long-term pictures. Reference picture list reordering (RPLR) is then performed when the slice header contains RPLR commands. The PRLR process may reorder the reference pictures into a different order than the order in the initial list. Lastly, the final list is constructed by keeping only a number of pictures in the beginning of the possibly reordered list, with the number being indicated by another syntax element in the slice header or the picture parameter set referred by the slice.
During the initialization process, all of the short-term and long-term pictures are considered as candidates for reference picture lists for the current picture. Regardless of whether the current picture is a B or P picture, long-term pictures are placed after the short-term pictures in RefPicList0 (and RefPicList1 available for B slices). For P pictures, the initial reference picture list for RefPicList0 contains all short-term reference pictures ordered in descending order of PicNum. For B pictures, those reference pictures obtained from all short term pictures are ordered by a rule related to the current POC number and the POC number of the reference picture—for RefPicList0, reference pictures with smaller POC (comparing to current POC) are considered first and inserted into the RefPictList0 with the descending order of POC. Then pictures with larger POC are appended with the ascending order of POC. For RefPicList1 (if available), reference pictures with larger POC (compared to the current POC) are considered first and inserted into the RefPicList1 with ascending order of POC. Pictures with smaller POC are then appended with descending order of POC. After considering all the short-term reference pictures, the long-term reference pictures are appended by ascending order of LongTermPicNum, both for P and B pictures.
The reordering process is invoked by continuous RPLR commands, which includes four types. The first type is a command to specify a short-term picture with smaller PicNum (comparing to a temporally predicted PicNum) to be moved. The second type is a command to specify a short-term picture with larger PicNum to be moved. The third type is a command to specify a long-term picture with a certain LongTermPicNum to be moved and the end of the RPLR loop. If the current picture is bi-predicted, then there are two loops—one for a forward reference list and the other for a backward reference list.
The predicted PicNum called picNumLXPred is initialized as the PicNum of the current coded picture. This is set to the PicNum of the just-moved picture after each reordering process for a short-term picture. The difference between the PicNum of the current picture being reordered and picNumLXPred is to be signaled in the RPLR command. The picture indicated to be reordered is moved to the beginning of the reference picture list. After the reordering process is completed, a whole reference picture list is to be truncated based on the active reference picture list size, which is num_ref_idx_lX_active_minus1+1 (X equal to 0 or 1 corresponds for RefPicList0 and RefPicList1 respectively).
The hypothetical reference decoder (HRD), specified in Annex C of the H.264/AVC standard, is used to check bitstream and decoder conformance. The HRD contains a coded picture buffer (CPB), an instantaneous decoding process, a decoded picture buffer (DPB), and an output picture cropping block. The CPB and the instantaneous decoding process are specified similarly to any other video coding standard, and the output picture cropping block simply crops those samples from the decoded picture that are outside of the signaled output picture extents. The DPB was introduced in H.264/AVC in order to control the required memory resources for decoding of conformant bitstreams.
There are two reasons to buffer decoded pictures, for references in inter prediction and for reordering decoded pictures into output order. As the H.264/AVC standard provides a great deal of flexibility for both reference picture marking and output reordering, separate buffers for reference picture buffering and output picture buffering could be a waste of memory resources. Therefore, the DPB includes a unified decoded picture buffering process for reference pictures and output reordering. A decoded picture is removed from the DPB when it is no longer used as reference and needed for output. The maximum size of the DPB that bitstreams are allowed to use is specified in the Level definitions (Annex A) of the H.264/AVC standard.
There are two types of conformance for decoders: output timing conformance and output order conformance. For output timing conformance, a decoder must output pictures at identical times compared to the HRD. For output order conformance, only the correct order of output picture is taken into account. The output order DPB is assumed to contain a maximum allowed number of frame buffers. A frame is removed from the DPB when it is no longer used as reference and needed for output. When the DPB becomes full, the earliest frame in output order is output until at least one frame buffer becomes unoccupied.
Temporal scalability is realized by the hierarchical B picture GOP structure using only AVC tools. A typical temporal scalability GOP usually includes a key picture which is coded as an I or P frame, and other pictures which are coded as B pictures. Those B pictures are coded hierarchically based on the POC. The coding of a GOP needs only the key pictures of the previous GOP besides those pictures in the GOP. The relative POC number (POC minus the previous anchor picture POC) is referred to as POCIdInGOP in implementation. Every POCIdInGOP can have a form of POCIdInGOP=2xy (wherein y is an odd number). Pictures with the same value of x belong to the same temporal level, which is noted as L−x (where L=log 2(GOP_length)). Only pictures with the highest temporal level L are not stored as reference pictures. Normally, pictures in a temporal level can only use pictures in lower temporal levels as references to support temporal scalability, i.e. higher temporal level pictures can be dropped without affecting the decoding of the lower temporal level pictures. Similarly, the same hierarchical structure can be applied in the view dimension for view scalability.
In the current JMVM, frame_num is separately coded and signaled for each view, i.e. the value of frame_num is incremented relative to the previous reference frame or reference complementary field pair within the same view as the current picture. Furthermore, pictures in all views share the same DPB buffer. In order to globally handle the reference picture list construction and the reference picture management, FrameNum and POC generation are redefined as follows:
FrameNum=frame_num * (1 + num_views_minus_1) + view_idPicOrderCnt( ) = PicOrderCnt( ) * (1 + num_views_minus_1) +view_id;
JMVM basically follows the same reference picture marking as that used for AVC. The only difference is that, in JMVM the FrameNum is redefined and so that the FrameNumWrap is redefined as follows:
if( FrameNum > frame_num * (1 + num_views_minus_1) + view_id )FrameNumWrap = FrameNum − MaxFrameNum * (1 +num_views_minus_1) +view_idelseFrameNumWrap = FrameNum
In the current JMVM standard, inter-view reference pictures are implicitly specified in the SPS (Sequence Parameter Set) extension, wherein the active number of inter-view reference lists and the view id of those pictures are specified. This information is shared by all pictures referring to the same SPS. The reference picture list construction process first performs reference picture list initialization, reordering and truncation in the same way as in AVC, but taking into account all of the reference pictures stored in the DPB. The pictures with view ids specified in the SPS and within same temporal axis (i.e. having the same capture/output time) are then appended to the reference list in the order as they are listed in the SPS.
Unfortunately, the above JSVM designs lead to a number of problems. First, it is sometimes desirable that switching of decoded (by a decoder), transmitted (by a sender) or forwarded (by a media gateway or MANE) views could occur in a time index other than such that corresponds to anchor pictures. For example, a base view can be compressed for highest coding efficiency (temporal prediction is heavily used) and anchor pictures are coded infrequently. Consequently, anchor pictures for other views also occur infrequently, as they are synchronized across all views. The current JMVM syntax does not include signaling of a picture from which decoding of a certain view can be started (unless all views of that time index contain an anchor picture).
Second, the allowed reference views for inter-view prediction are specified for each view (and separately for anchor and non-anchor pictures). However, depending on the similarity between a picture being coded and a potential picture in the same temporal axis and in a potential reference view, inter-view prediction may or may not be performed in the encoder. The current JMVM standard uses nal_ref_idc to indicate whether a picture is used for intra-view or inter-view prediction, but it cannot separately indicate if a picture is used for intra-view prediction and/or inter-view prediction. In addition, according to JMVM 1.0, for the AVC compatible view, nal_ref_idc must be set to not equal to 0 even if the picture is not used for temporal prediction when it is used only for inter view prediction reference. Consequently, if only that view is decoded and output, additional DPB size is needed for storage of such pictures when such pictures can be outputted as soon as they are decoded.
Third, it is noted that the reference picture marking process specified in JMVM 1.0 is basically identical to the AVC process, except for the redefinition of FrameNum, FrameNumWrap and consequently PicNum. Therefore, a number of special problems arise. For example, this process cannot efficiently handle the management of decoded pictures that are required to be buffered for inter-view prediction, particularly when those pictures are not used for temporal prediction reference. The reason is that the DPB management process specified in the AVC standard was intended for single-view coding. In single-view coding such as in the AVC standard, decoded pictures that need to be buffered for temporal prediction reference or future output can be removed from the buffer when they are no longer needed for temporal prediction reference and future output. To enable the removal of a reference picture as soon as it becomes no longer needed for temporal prediction reference and future output, the reference picture marking process is specified such that it can be known immediately after a reference picture becomes no longer needed for temporal prediction reference. However, when it comes to pictures for inter-view prediction reference, there lacks a way to immediately know after a picture becomes no longer needed for inter-view prediction reference. Consequently, pictures for inter-view prediction reference may be unnecessarily buffered in the DPB, which reduces the efficiency of the buffer memory usage.
In another example, given the way to recalculate the PicNum, if the sliding window operation mode is in use and the number of short-term and long-term pictures is equal to the maximum, the short-term reference picture that has the smallest FrameNumWrap is marked as “unused for reference.” However, due to the fact that this picture is not necessarily the earliest coded picture because the FrameNum order in the current JMVM does not follow the decoding order, the sliding window reference picture marking does not operate optimally in the current JMVM. Still further, due to the fact that PicNum is derived from the redefined and scaled FrameNumWrap, the difference between the PicNum values of two coded pictures would be scaled in average. For example, it is helpful to assume that there are two pictures in the same view and having frame_num equal to 3 and 5, respectively. When there is only one view, i.e. the bitstream is an AVC stream, then the difference of the two PicNum values would be 2. When coding the picture having frame_num equal to 5, if an MMCO command is needed to mark the picture having PicNum equal to 3 as “unused for reference”, then the difference of the two values minus 1 is equal to 1, which is to be signalled in the MMCO. This value needs 3 bits. However, if there are 256 views, then the difference of the two PicNum values minus 1 would become 511. In this case, 19 bits are required for signalling of the value. Consequently, MMCO commands are much less efficiently coded. Typically, the increased number of bits is equal to 2*log 2(number of views) for an MMCO command of the current JMVM compared to single-view coding of H.264/AVC.
A fourth set of problems surrounds the reference picture list construction process specified in JMVM 1.0. The reference picture list initialization process considers reference pictures from all views before the reordering process. However, due to the fact that the pictures from other views used for inter-view prediction are appended to the list after truncating the list, reference pictures from other views do not appear in the reference picture list after reordering and truncation anyway. Therefore, consideration of those pictures in the initialization process is a not needed. Furthermore, illegal reference pictures (such pictures that have a different view_id than the current picture and are not temporally aligned with the current picture) and repeated inter-view reference pictures may appear in the finally constructed reference picture list.
The reference picture list initialization process operates as listed in the following steps: (1) All of the reference pictures are included in the initial list regardless of their view_id and whether they are temporally aligned with the current picture. In other words, the initial reference picture list may contain illegal reference pictures (such pictures that have a different view_id than the current picture and are not temporally aligned with the current picture). However, in view-first coding, the beginning of the initial list contains reference pictures from the same view as the current picture. (2) Both intra-view reference pictures and inter-view pictures may be reordered. After reordering, the beginning of the list may still contain illegal reference pictures. (3) The list is truncated, but the truncated list may still contain illegal reference pictures. (4) The inter-view reference pictures are appended to the list in the order they appear in the MVC extension of SPS.
Additionally, the reference picture list reordering process specified in JMVM 1.0 does not allow for the reordering of inter-view frames, which are always put in the end of the list in the order as they appear in the MVC extension of SPS. This causes less flexibility for reference picture list construction, which results in reduced compression efficiency, when the default order of inter-view reference frames is not optimal or certain inter-view reference frames are more likely to be used for prediction than certain intra-view reference frames. Still further, similar to MMCO commands, due to the fact that PicNum is derived from the redefined and scaled FrameNumWrap, longer VLC codewords are required for coding of RPLR commands involving the signaling of a difference between PicNum values minus 1 compared to the single-view coding of the H.264/AVC standard.