This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
Video coding standards include ITU-T H.261, ISO/IEC MPEG-1 Visual, ITU-T H.262 or ISO/IEC MPEG-2 Visual, ITU-T H.263, ISO/IEC MPEG-4 Visual and ITU-T H.264 (also known as ISO/IEC MPEG-4 AVC). In addition, there are currently efforts underway with regards to the development of new video coding standards. One such standard under development is the SVC standard, which will become the scalable extension to H.264/AVC. Another standard under development is the multi-view coding standard (MVC), which is also an extension of H.264/AVC. Yet another such effort involves the development of China video coding standards.
A draft of the SVC standard is described in JVT-U202, “Joint Draft 8 with proposed changes”, 21st JVT meeting, HangZhou, China, October 2006, available at http://ftp3.itu.ch/av-arch/jvt-site/2006—10_Hangzhou/JVT-U202.zip. A draft of the MVC standard is in described in JVT-U209, “Joint Draft 1.0 on Multiview Video Coding”, 21 JVT meeting, HangZhou, China, October 2006, available at ftp3.itu.ch/av-arch/jvt-site/2006—10_Hangzhou/JVT-U209.zip.
Scalable media is typically ordered into hierarchical layers of data, where a video signal can be encoded into a base layer and one or more enhancement layers. A base layer can contain an individual representation of a coded media stream such as a video sequence. Enhancement layers can contain refinement data relative to previous layers in the layer hierarchy. The quality of the decoded media stream progressively improves as enhancement layers are added to the base layer. An enhancement layer enhances the temporal resolution (i.e., the frame rate), the spatial resolution, and/or simply the quality of the video content represented by another layer or part thereof. Each layer, together with all of its dependent layers, is one representation of the video signal at a certain spatial resolution, temporal resolution and/or quality level. Therefore, the term “scalable layer representation” is used herein to describe a scalable layer together with all of its dependent layers. The portion of a scalable bitstream corresponding to a scalable layer representation can be extracted and decoded to produce a representation of the original signal at a certain fidelity.
The concept of a video coding layer (VCL) and a network abstraction layer (NAL) is inherited from advanced video coding (AVC). The VCL contains the signal processing functionality of the codec e.g., mechanisms such as transform, quantization, motion-compensated prediction, loop filter, and inter-layer prediction. A coded picture of a base or enhancement layer consists of one or more slices. The NAL encapsulates each slice generated by the VCL into one or more NAL units.
Each SVC layer is formed by NAL units, representing the coded video bits of the layer. A Real Time Transport Protocol (RTP) stream carrying only one layer would carry NAL units belonging to that layer only. An RTP stream carrying a complete scalable video bit stream would carry NAL units of a base layer and one or more enhancement layers. SVC specifies the decoding order of these NAL units.
In some cases, data in an enhancement layer can be truncated after a certain location, or at arbitrary positions, where each truncation position may include additional data representing increasingly enhanced visual quality. In cases where the truncation points are closely spaced, the scalability is said to be “fine-grained,” hence the term “fine grained (granular) scalability” (FGS). In contrast to FGS, the scalability provided by those enhancement layers that can only be truncated at certain coarse positions is referred to as “coarse-grained (granularity) scalability” (CGS). In addition, the draft SVC coding standard noted above can also support what is conventionally referred to as “medium grained (granular) scalability” (MGS). According to MGS, quality enhancement pictures are coded similarly to CGS scalable layer pictures, but can be indicated by high-level syntax elements as is similarly done with FGS layer pictures. It may be noted that enhancement layers can collectively include CGS, MGS, and FGS quality (SNR) scalability and spatial scalability.
According to H.264/AVC, an access unit comprises one primary coded picture. In some systems, detection of access unit boundaries can be simplified by inserting an access unit delimiter NAL unit into the bitstream. In SVC, an access unit may comprise multiple primary coded pictures, but at most one picture per each unique combination of dependency_id, temporal_id, and quality_id. A coded picture as described herein can refer to all of the NAL units within an access unit having particular values of dependency_id and quality_id. It is noted that the terms to be used in SVC can change. Therefore, what may be referred to as a coded picture herein may be subsequently referenced by another term, such as a layer representation.
SVC uses a similar mechanism as that used in H.264/AVC to provide hierarchical temporal scalability. In SVC, a certain set of reference and non-reference pictures can be dropped from a coded bitstream without affecting the decoding of the remaining bitstream. Hierarchical temporal scalability requires multiple reference pictures for motion compensation, i.e., there is a reference picture buffer containing multiple decoded pictures from which an encoder can select a reference picture for inter prediction. In H.264/AVC, a feature called sub-sequences enables hierarchical temporal scalability, where each enhancement layer contains sub-sequences and each sub-sequence contains a number of reference and/or non-reference pictures. The sub-sequence is also comprised of a number of inter-dependent pictures that can be disposed without any disturbance to any other sub-sequence in any lower sub-sequence layer. The sub-sequence layers are hierarchically arranged based on their dependency on each other and are equivalent to temporal levels in SVC. Therefore, when a sub-sequence in the highest sub-sequence layer is disposed, the remaining bitstream remains valid. In H.264/AVC, signaling of temporal scalability information is effectuated by using sub-sequence-related supplemental enhancement information (SEI) messages. In SVC, the temporal level hierarchy is indicated in the header of NAL units.
In addition, SVC uses an inter-layer prediction mechanism, whereby certain information can be predicted from layers other than a currently reconstructed layer or a next lower layer. Information that could be inter-layer predicted includes intra texture, motion, and residual data. Inter-layer motion prediction also includes the prediction of a block coding mode, header information, etc., where motion information from a lower layer may be used for predicting a higher layer. It is also possible to use intra coding in SVC, i.e., a prediction from surrounding macroblocks (MBs) or from co-located MBs of lower layers. Such prediction techniques do not employ motion information and hence, are referred to as intra prediction techniques. Furthermore, residual data from lower layers can also be employed for predicting the current layer.
When compared to previous video compression standards, SVC's spatial scalability has been generalized to enable the base layer to be a cropped and zoomed version of the enhancement layer. Moreover, quantization and entropy coding modules have also been adjusted to provide FGS capability. The coding mode is referred to as progressive refinement, where successive refinements of transform coefficients are encoded by repeatedly decreasing the quantization step size and applying a “cyclical” entropy coding akin to sub-bitplane coding.
SVC also specifies a concept of single-loop decoding. Single-loop decoding can be enabled by using a constrained intra texture prediction mode, where an inter-layer intra texture prediction can be applied to MBs for which the corresponding block of the base layer is located inside intra-MBs. At the same time, those intra-MBs in the base layer use constrained intra prediction. Therefore, in single-loop decoding, the decoder needs to perform motion compensation and full picture reconstruction only for the scalable layer desired for playback (i.e., the desired layer), thereby reducing decoding complexity. All layers other than the desired layer do not need to be fully decoded because all or part of the data of the MBs not used for inter-layer prediction (be it inter-layer intra texture prediction, inter-layer motion prediction or inter-layer residual prediction) is not needed for reconstruction of the desired layer.
A single decoding loop is generally necessary for the decoding of most pictures, while a second decoding loop is applied to reconstruct the base representations. It should be noted that no FGS or MGS enhancement of an access unit is used in the reconstruction of a base representation of the access unit. The base representations are needed for prediction reference but not for output or display, and are reconstructed only for “key pictures.” Base representation is typically used for inter prediction of the base representation of the next key picture. Periodical use of base representations in inter prediction stop potential drift and its temporal propagation caused by those FGS or MGS enhancement layer NAL units that have been truncated or lost in the transmission path from the encoder to the decoder.
The scalability structure in the SVC draft noted above is characterized by three syntax elements: temporal_id; dependency_id; and quality_id. The syntax element, temporal_id, is used to indicate the temporal scalability hierarchy or indirectly, the frame rate. A scalable layer representation comprising pictures of a smaller maximum temporal_id value has a smaller frame rate than a scalable layer representation comprising pictures of a greater maximum temporal_id. A given temporal layer typically depends on the lower temporal layers (e.g., the temporal layers with smaller temporal_id values) but do not generally depend on any higher temporal layer.
The syntax element, dependency_id, can be used to indicate the CGS inter-layer coding dependency hierarchy (which includes both SNR and spatial scalability). At any temporal level location, a picture of a smaller dependency_id value may be used for inter-layer prediction for coding of a picture with a larger dependency_id value.
The syntax element, quality_id, can be used to indicate the quality level hierarchy of a FGS or MGS layer. At any temporal location, and with an identical dependency_id value, a picture with a quality_id equal to QL uses the picture with a quality_id equal to QL-1 for inter-layer prediction. A coded slice with a quality_id larger than zero may be coded as either a truncatable FGS slice or a non-truncatable MGS slice.
For simplicity, all of the data units (e.g., NAL units or NAL units in the SVC context) in one access unit having an identical or matching dependency_id value are referred to as a dependency unit, where a temporal level hierarchy can be indicated in the header of a NAL unit.
One characteristic feature of SVC is that the FGS NAL units can be freely dropped or truncated and MGS NAL units can be freely dropped without affecting the conformance of the bitstream. However, when that FGS or MGS data has been used as an inter prediction reference during encoding, dropping or truncating the data would result in a mismatch during the reconstruction signal in the decoder side and the reconstruction signal in the encoder side. This mismatch can be referred to as drift, as noted above.
To control drift due to the dropping or truncating of FGS or MGS data, SVC can, in a certain dependency unit, store a base representation (by decoding only the CGS picture with quality_id equal to zero and all the depended-on lower layer data) in a decoded picture buffer. When encoding a subsequent dependency unit with the same dependency_id value, all of the NAL units, including FGS or MGS NAL units, use the base representation for an inter prediction reference. Consequently, all drift due to the dropping or truncating of FGS or MGS NAL units in an earlier access unit is held to this access unit. For other dependency units with the same value of dependency_id, all of the NAL units use the enhanced representations (decoded from NAL units with the greatest value of quality_id and the dependent-on lower layer data) for inter prediction reference. Such a technique can result in a high coding efficiency.
According to the SVC draft described in the JVT-U202 reference noted above, each NAL unit includes in the NAL unit header, a syntax element referred to as use_base_prediction_flag. When the use_base_prediction_flag value equals one, it specifies that decoding of the NAL unit uses the base representations of the reference pictures during the inter prediction process. The syntax element, store_base_rep_flag, specifies whether, when equal to one, or not (when equal to zero), to store the base representation of the current picture for future pictures to use for inter prediction, in addition to the enhanced representation.
In conversational video communications systems, such as video telephony, there is usually a feedback channel from a receiver to a sender. The feedback channel can be utilized for, among other things, recovering from transmission errors. Interactive error control messages from the receiver to the sender can be categorized as intra update requests, loss indications, and positive acknowledgements of correctly received and decoded data. The encoder can respond to such messages by intra coding or performing encoding using only those reference pictures that are correct in content. The encoder can also further improve compression efficiency and completeness of error correction if it tracks the spatial propagation of the indicated errors at the decoder side. Moreover the encoder can recover those areas that are damaged by spatial error propagation and use any undamaged areas as references for inter prediction.
Various literature and standards regarding interactive error control for low-latency video communication have been provided, where both ITU-T H.323/H.324-based video conferencing systems and RTP-based conferencing systems are considered.
ITU-T Recommendation H.245 is a control protocol for ITU-T H.323/324 video conferencing systems. Among other things, it specifies commands and indications used in a feedback channel from a receiver to a sender. A command according to H.245 can be a message that requires action but no explicit response. Alternatively, an indication according to H.245 can contain information, which does not require an action or response thereto. H.245 specifies messages for H.261, H.263, MPEG-1 video, and MPEG-2 video. In addition, the use of H.264/AVC in H.323/324 video conferencing systems is specified in ITU-T Recommendation H.241.
RTP can be used for transmitting continuous media data, such as coded audio and video streams in Internet Protocol (IP)-based networks. The Real-time Transport Control Protocol (RTCP) is a companion of RTP, i.e., RTCP can always be used to complement RTP when the network and application infrastructure allow it. RTP and RTCP are generally conveyed over the User Datagram Protocol (UDP), which in turn, is conveyed over IP. There are two versions of IP, i.e., IPv4 and IPv6, where one difference between the two versions has to do with the number of addressable endpoints.
RTCP can be used to monitor the quality of service provided by a network and to convey information about the participants in an on-going session. RTP and RTCP are designed for sessions that range from one-to-one communication to large multicast groups of thousands of endpoints. In order to control the total bitrate caused by RTCP packets in a multiparty session, the transmission interval of RTCP packets transmitted by a single endpoint is relative to the number of participants in the session. Each media coding format has a specific RTP payload format, which specifies how the media data is structured in the payload of an RTP packet.
A number of profiles have been specified for RTP, each of which specifies extensions or modifications to RTP that are specific to a particular family of applications. A popular profile is called the RTP profile for audio and video conferences with minimal control (RTP/AVP). The specification provides the semantics of generic fields in an RTP header for use in audio and video conferences. The specification also specifies the RTP payload format for certain audio and video codecs.
Another RTP profile is known as the audio-visual profile with feedback (RTP/AVPF). The RTP/AVPF allows terminals to send feedback faster than RTCP originally allowed and can therefore be used to convey messages for interactive error repair. If the number of participants in a session is smaller than a certain threshold, the Immediate Feedback mode of RTP/AVPF can be used. The Immediate Feedback mode allows each participant to report a feedback event almost immediately. The early RTCP mode of RTP/AVPF is applied when the number of participants is such that the Immediate Feedback mode cannot be used. Therefore, faster feedback than plain RTCP is enabled, but lacks the near immediate feedback of RTP/AVPF.
A simple method of recovery from transmission errors is to request a far-end encoder to encode erroneous areas in intra coding mode. In addition to recovery from transmission errors, a fast update picture command can be issued by a multipoint conference control unit (MCU) when there is a need to switch from one video originator to another during centralized multipoint conferencing. H.245 provides three video fast update commands: fast update commands for a picture; fast update commands for a group of blocks (GOB) of H.261 and H.263; and fast update commands for a number of MBs in raster scan order. These fast update commands are generally referred to as videoFastUpdatePicture, videoFastUpdateGOB, and videoFastUpdateMB, respectively.
The fast update commands require the encoder to update an indicated picture area, which in practice is interpreted as intra coding, although the encoder response to the fast update commands is not specified explicitly in H.245. In contrast to H.245, H.241 only allows for the fast update picture command for H.264/AVC, and specifies two alternative procedures to respond to a received fast update picture command. In a first procedure, an Instantaneous Decoding Refresh (IDR) picture and any referred parameter sets can be transmitted. In a second procedure the picture area is updated gradually, e.g., in a number of consecutive pictures. With intra coding, a recovery point SEI message is sent to indicate when the entire picture area is correct in content, and any referred parameter sets are also transmitted. This gradual recovery procedure can be used in error-prone transmission environments in which an IDR picture would be likely to experience transmission errors due to its large size relative to a typical inter picture. The codec control messages for RTP/AVPF include a full intra request command, which is equivalent to the video fast update picture command of H.245.
Intra coding resulting from the fast update commands reduces compression efficiency as compared to inter coding. In order to improve the compression efficiency, an encoder can choose a reference picture for inter prediction that is known to be correct and available based on feedback from the far-end decoder. This technique is often referred to as NEWPRED as described in “Study on Adaptive Reference Picture Selection Coding Scheme For the NEWPRED-Receiver-Oriented Mobile Visual Communication System” to Kimata et al. This technique requires that the video coding scheme allows the use of multiple reference pictures. Hence, H.263 Annex N, Annex U, or H.264/AVC, for example, can be used. In accordance with NEWPRED, two types of feedback messages can be utilized: negative acknowledgements (NACKs) for indicating that a certain packet or a certain picture or certain areas of a particular picture were not received correctly; and positive acknowledgements (ACKs) for indicating which pictures or parts of pictures are either correctly received or correct in content. A picture or a part thereof is correct in content if the coded data is correctly received and all of the data used for prediction is correct.
When NACKs are in use, an encoder conventionally uses any available reference picture for inter prediction, except for those pictures that are known to be erroneous based on the received NACK messages. Because end-to-end delay may be greater than the interval between two encoded pictures, the encoder may not know that some of the recently encoded reference pictures are not received correctly at the time of the encoding of a new picture. Thus, the NACK mode of NEWPRED stops error propagation in approximately one round-trip time period, similar to the fast update requests. When ACKs are in use, the encoder typically uses only those reference pictures for inter prediction that are known to be correct in content based on the received ACK messages.
Various mechanisms exist for conveying reference picture selection messages. The syntax for the messages can be specified within the control protocol in use. Alternatively, the control protocol can provide a generic framing mechanism to convey reference picture selection messages that are specified external to the control protocols. In accordance with control protocol-specified messages, H.245 includes loss commands that are specific to an entire picture (i.e., the lostPicture command). H.245 also includes loss commands that are specific to a number of MBs in raster scan order (i.e., the videoBadMBs command for use with H.263 Annex N). Lastly, H.245 includes loss commands that can explicitly indicate the picture in which a loss occurred (i.e., the lostPartialPicture command for use with H.263 Annex U).
The far-end video encoder must take corrective action, such as intra coding or selection of a correct reference picture as a response to a received loss command. The recovery reference picture command of H.245 requires the far-end encoder to use only the indicated pictures for prediction. In other words, it is similar to the ACK message described above. It should be noted that RTP/AVPF includes a generic NACK message, which is able to indicate the loss of one or more RTP packets, a picture loss indication, and a slice loss indication. The picture loss indication is the lostPicture command equivalent of H.245, and the slice loss indication is equivalent to the lostPartialPicture command of H.245.
As noted above, the payload syntax for back-channel messages can be specific to a codec or the syntax may be generic to any codec and the semantics of the generic syntax are specified for each codec separately. Examples of codec-specific back-channel syntax can include the messages specified in H.263 Annex N and Annex U and the NEWPRED upstream message of MPEG-4 Visual described above. Alternatively, ITU-T Recommendation H.271 specifies a generic back-channel message syntax for use with any video codec. Six messages are specified in H.271 including: an indication that one or more pictures are decoded without detected errors; an indication that one or more pictures are entirely or partially lost; and an indication that all or certain data partitions of a set of coding blocks of one picture are lost. In addition, the following messages are also specified in H.271: a cyclical redundancy check (CRC) value for one parameter set; a CRC value for all parameter sets of a certain type; and a reset request indicating that the far-end encoder should completely refresh the transmitted bitstream as if no prior video data had been received.
The semantics used for identifying a picture, the size of the coding block in terms of samples, and the definition of parameter sets are specific to the coding format. Therefore, H.271 specifies the semantics of the generic message syntax for H.261, H.263, and H.264/AVC. The back-channel messages specified in H.263 Annex N and Annex U as well as in H.271 can be conveyed in a separate logical channel on top of H.245. Similarly, RTP/AVPF includes a reference picture selection indication that carries a back-channel message according to a video coding standard, where the codec control messages extension of RTP/AVPF includes a video back-channel message that carries messages according to H.271.
Error tracking refers to a determination of whether a picture or part of a picture is correct in content. Alternatively, error tracking can refer to whether or not a picture or part of a picture is somehow not in accordance with associated information regarding data loss, corruption in transmission, and/or a coding prediction relationship. The coding prediction relationship includes the conventional inter prediction (i.e., motion compensation prediction), the conventional in-picture prediction (i.e., intra picture sample or coefficient prediction, motion vector prediction, and loop filtering), and inter-layer prediction in SVC context. Error tracking can either be performed by the encoder or the decoder.
For example, if a frame n is damaged and a corresponding back-channel feedback message arrives at the encoder when it is time to encode frame n+d, the encoder reconstructs the location of the damaged areas in frames n to n+d−1 in the decoder. The reconstruction can be based on the motion vectors in frames n+1 to n+d−1. Therefore, the encoder can avoid using any of the damaged areas in frames n to n+d−1 for inter prediction. An example of an error tracking algorithm is provided in H.263.
Error tracking can be further refined if the feedback messages contain information regarding which error concealment method the decoder used or which error concealment method has been pre-determined in a system. In response to receiving a feedback message concerning frame n, the encoder must reconstruct the decoding process exactly for frames n to n+d−1 so that the reference pictures at the encoder match the reference pictures in the decoder accurately. Support for joint error concealment and error tracking is included in H.245. To be more precise, the “not-decoded MBs indication” of H.245 signals can indicate which MBs were received erroneously and treated as not coded. In other words, the message indicates that a copy of the co-located MBs in the previous frame was used for error concealment. However, due to the computational requirements and complexity of error tracking associated with known error concealment, there are no relevant mechanisms other than the not-decoded MBs indication of H.245 available. Moreover, the not-decoded MBs indication is not widely used.
Use of a feedback message in association with pictures stored as reference but not output similarly as the base representations for FGS and MGS has been described in U.S. patent application Ser. No. 09/935,119 and U.S. patent application Ser. No. 11/369,321.
However, problems exist with the above-described conventional systems and methods. The problem is illustrated with the following example, where the example assumes a video communication system with live encoding and a feedback channel from the far-end decoder to the encoder. The following two access units in the bitstream can be considered:                Q1,n . . . Q1,n+m        Q0,n . . . Q0,n+mWhere, the notation is as follows:        Q0,n—coded picture with quality_id equal to zero of access unit n        Q1,n—coded picture with quality_id equal to one of access unit n        Q0,n+m—coded picture with quality_id equal to zero of access unit n+m        Q1,n+m—coded picture with quality_id equal to one of access unit n+m        
Access unit n is a key picture, i.e., the encoder sets the value of the store_base_rep_flag equal to 1. It can be assumed that all the quality layers of access unit n are successfully decoded and the far-end decoder sends a feedback message indicating the successful decoding to the encoder. The feedback message is received before encoding the next “key” access unit (n+m). When encoding the access unit n+m, the encoder can set a use_base_prediction_flag. The use_base_prediction_flag can be set to zero for (Q0,n+m) and (Q1,n+m), such that both of the coded pictures are predicted from (Q1,n) instead of (Q0,n) for improved coding efficiency. At the same time a store_base_rep_flag can be set to one for both (Q1,n+m) and (Q0,n+m), such that the base representation is stored for future pictures' inter prediction.
Therefore, a problem exists in that (Q1,n) may be lost during transmission. Alternatively, a media-aware network element (MANE) or the sender may adapt it by discarding some or all of the data of (Q1,n). That is, a detection of whether the access unit n is correctly decoded in its entirety is needed to create a valid feedback message from the far-end decoder to the encoder. However, according to the SVC draft in JVT-U202, the far-end decoder has no way to determine whether (Q1,n) was originally present in the bitstream or whether (Q1,n) contained originally more data. This is because the bitstream may be valid regardless of the presence of the FGS or MGS picture (Q1,n). Furthermore, when (Q1,n) contains FGS slices, there is no way to determine whether the NAL units have been truncated.