Recent developments in video and audio coding have produced effective layered representations. A layered representation is such that the original signal is represented at more than one fidelity levels using a corresponding number of bitstreams. One example of a layered representation is scalable coding. In scalable coding, such as the one used in ITU-T Recommendation H.264 Annex G (Scalable Video Coding—SVC), incorporated herein by reference in its entirety, a first fidelity point is obtained by encoding the source using standard H.264 techniques (Advanced Video Coding—AVC). An additional fidelity point can be obtained by encoding the resulting coding error (the difference between the original signal and the decoded version of the first fidelity point) and transmitting it in its own bitstream. This pyramidal construction is quite common (e.g., it was used in MPEG-2 and MPEG-4 Video). The first (lowest) fidelity level bitstream is referred to as the base layer, and the bitstreams providing the additional fidelity points are referred to as enhancement layers. The fidelity enhancement for can be in any fidelity dimension. For example, for video it can be temporal (frame rate), quality (SNR), or spatial (picture size). For audio, it can be temporal (samples per second), quality (SNR), or additional channels. Note that the various layer bitstreams can be transmitted separately or, typically, can be transmitted multiplexed in a single bitstream with appropriate information that allows the direct extraction of the sub-bitstreams corresponding to the individual layers.
Another example of a layered representation is multiple description coding. Here the construction is not pyramidal: each layer is independently decodable and provides a representation at a basic fidelity; if more than one layer is available to the decoder, however, then it is possible to provide a decoded representation of the original signal at a higher level of fidelity. One example would be transmitting the odd and even pictures of a video signal as two separate bitstreams. Each bitstream alone offers a first level of fidelity, whereas any information received from other bitstreams can be used to enhance this first level of fidelity. If all streams are received, then there is a complete representation of the original at the maximum level of quality afforded by the particular representation.
Yet another extreme example of a layered representation is simulcasting. In this case, two or more independent representations of the original signal are encoded and transmitted in their own streams. This is often used, for example, to transmit Standard Definition TV material and High Definition TV material. It is noted that simulcasting is a special case of scalable coding where no inter-layer prediction is used.
When layered representations of audio or video signals are transmitted over packet-based networks, there are advantages when each layer (or groups of layers) is transmitted over its own connection, or session. In this way, a receiver that only wishes to decode the base quality only needs to receive the particular session, and is not burdened by the additional bit rate required to receive the additional layers. Layered multicast is a well-known application that uses this architecture. Here the source multicasts the content's layers over multiple multicast channels, and receivers “subscribe” only to the layer channels they wish to receive.
Transmission of video and audio in IP-based networks typically uses the Real-Time Protocol (RTP) as the transport protocol. RTP operates typically over UDP, and provides a number of features needed for transmitting real-time content, such as payload type identification, sequence numbering, time stamping, and delivery monitoring. Each source transmitting over an RTP session is identified by a unique SSRC (Synchronization Source). The packet sequence number and timestamp of an RTP packet are associated with that particular SSRC.
In general, the transmission order of media packets in an RTP stream follows the intended decoding order. In some applications, however, it is desirable to be able to modify the transmission order, “interleaving” the packets. One example is rate shaping, where a transmitter changes the order of transmission of packets in order to better utilize a given fixed available bitrate while at the same time minimizing the buffering that must be used at a receiver prior to commencing playback (ensuring uninterrupted playback after that time).
When interleaving is used, the sequence number present in RTP packets no longer corresponds to the decoding order. At the same time, if, for example, video is used with bi-directional prediction (e.g., MPEG-2 B pictures or H.264 bi-directional B pictures), then the decoding order is no longer identical to the ordering implied by the RTP timestamp of each packet. For example, in MPEG-2 coding with a pattern of I1 B2 B3 P4, then the picture P4 has a later RTP timestamp than B2 but has to be decoded prior to decoding B2 (or B3). For most, if not all codecs, it is impossible to correctly recover the decoding order when interleaving is used with bi-directional prediction, unless one examines the contents of the media packets.
A solution to this problem is RFC 3984, which uses the concept of the “Decoding Order Number” (DON), a specific field in packet headers or a derived variable that indicates the proper decoding order of H.264 “frames”, called Network Abstraction Layer (NAL) units. RFC 3984 describes how NAL units are transported in RTP packets, including mechanisms for recovering the decoding order. These mechanisms are only used in the “interleaved” packetization mode of RFC 3984.
The decoding order recovery problem is also present in the transmission of layered media. In this case, the problem is not the recovery of the decoding order within one RTP stream, but rather the recovery of the decoding order considering packets across all layer streams. It is noted that the problem exists regardless if one or more of the sessions are using interleaving, i.e., it is present even if all individual layer packets are transmitted in decoding order within their respective RTP sessions.
One technique that has been proposed for recovering the decoding order in layered transmission of audio and video is used in multimedia multicast distribution using layered audio and video compression, and uses the concepts of a Layer Sequence Number (SEQ) and Cross Layer Sequence Number (XSEQ). The SEQ operates in the same way as the sequence number used in RTP, i.e., it is a sequential numbering scheme within the packet stream of a particular layer. The XSEQ, however, is a numbering scheme that runs sequentially according to decoding order and spans all layers. The combination of SEQ and XSEQ allows a receiver to recover the decoding order even in the presence of packet errors. Although the technique is described within the concept of scalable coding using H.263 Annex O (SNR and spatial scalability), it can be applied to any layered coding scheme.
Another technique proposed for addressing the recovery of decoding order across layers is described in Internet-Draft draft-ietf-avt-rtp-svc-08 (Feb. 25, 2008), incorporated herein by reference in its entirety, and referred to in the following as ID-SVC. This technique is concerned with the definition of an RTP payload format for H.264 SVC, the scalable extension to H.264. The process is based on “Cross-Layer Decoding Order Numbers” (CL-DON). CL-DONs extend the concept of DON found in RFC 3894 such that DON values indicated in the base layer (H.264 compliant by design) are interpreted to be cross-layer. Furthermore, for enhancement layer packets, a DONC field is present in the packet header to indicate the cross-layer DON. CL-DON are similar to the XSEQ numbers discussed above.
A limitation present in ID-SVC is that the CL-DON technique cannot be applied when the base RTP session (the one carrying the base layer) uses the single NAL unit mode or non-interleaved mode of RFC 3984, as there is no provision for placing DONs in the packet headers in these modes. A second limitation is that the CL-DON technique cannot be used when fragmentation is employed in the base layer using the non-interleaved mode (fragmentation unit type A or FU-A packets, in RFC 3984 terminology) as again there is no provision for carrying the DON field in that mode.