This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
Advanced Video Coding (AVC), also know as H.264/AVC, is a video coding standard developed by the Joint Video Team (JVT) of ITU-T Video Coding Expert Group (VCEG) and ISO/IEC Motion Picture Expert Group (MPEG). AVC includes the concepts of a Video Coding Layer (VCL) and a Network Abstraction Layer (NAL). The VCL contains the signal processing functionality of the codec—mechanisms such as transform, quantization, motion-compensated prediction, and loop filters. A coded picture consists of one or more slices. The NAL encapsulates each slice generated by the VCL into one or more NAL units. A NAL unit is comprised of a NAL unit header and a NAL unit payload. The NAL unit header contains, among other, the NAL unit type indicating whether the NAL unit contains a coded slice, a coded slice data partition, a sequence or picture parameter set, and so on. A NAL unit stream is simply a concatenation of a number of NAL units. An encoded bitstream according to H.264/AVC or its extensions, e.g. SVC, is either a NAL unit stream, or a byte stream by prefixing a start code to each NAL unit in a NAL unit stream.
Scalable Video Coding (SVC) provides scalable video bitstreams. A scalable video bitstream contains a non-scalable base layer and one or more enhancement layers. An enhancement layer may enhance the temporal resolution (i.e. the frame rate), the spatial resolution, or the quality of the video content represented by the lower layer or part thereof. In the SVC extension of AVC, the VCL and NAL concepts were inherited.
Multi-view Video Coding (MVC) is another extension of AVC. An MVC encoder takes input video sequences (called different views) of the same scene captured from multiple cameras and outputs a single bitstream containing all the coded views. MVC also inherited the VCL and NAL concepts.
Real-time Transport Protocol (RTP) is widely used for real-time transport of timed media such as audio and video. In RTP transport, media data is encapsulated into multiple RTP packets. A RTP payload format for RTP transport of AVC video is specified in IETF Request for Comments (RFC) 3984, which is available from www.rfc-editor.org/rfc/rfc3984.txt and the contents of which are incorporated herein by reference. For AVC video transport using RTP, each RTP packet contains one or more NAL units.
IETF RFC 3984 specifies several packetization modes, one of which is an interleaved mode. If the interleaved packetization mode is in use, then NAL units from more than one access units can be packetized into one RTP packets. RFC 3984 also specifies the concept of decoding order number (DON) that indicates the decoding orders of a NAL units conveyed in an RTP stream.
In the SVC RTP payload format draft, Internet-Draft in draft-wenger-avt-rtp-svc-03 (available from http://www.tools.ietforg/html/draft-wenger-avt-rtp-svc-03), a new NAL unit type, referred to as a payload content scalability information (PACSI) NAL unit, is specified. The PACSI NAL unit, if present, is the first NAL unit in an aggregation packet, and it is not present in other types of packets. The PACSI NAL unit indicates scalability characteristics that are common for all of the remaining NAL units in the payload, thus making it easier for a media aware network element (MANE) to decide whether to forward/process/discard the aggregation packet. Senders may create PACSI NAL units. Receivers may ignore PACSI NAL units or use them as hints to enable the efficient aggregation packet processing. When the first aggregation unit of an aggregation packet contains a PACSI NAL unit, there is at least one additional aggregation unit present in the same packet. The RTP header fields are set according to the remaining NAL units in the aggregation packet. When a PACSI NAL unit is included in a multi-time aggregation packet, the decoding order number for the PACSI NAL unit is set to indicate that the PACSI NAL unit is the first NAL unit in decoding order among the NAL units in the aggregation packet, or the PACSI NAL unit has an identical decoding order number to the first NAL unit in decoding order among the remaining NAL units in the aggregation packet.
Decisions as to which NAL units should be transmitted and/or processed are generally required for several different purposes. For example, in multipoint real-time communication systems, e.g., multiparty video conferencing, the sender(s) may not know the capabilities of all receivers, e.g., when the number of receivers is large or when receivers can join the multipoint session without notification to the sender(s). If possible, the senders should not be limited according to the capabilities of the weakest receiver, as that limits the quality of experience that can be provided to other receivers. Consequently, it would be beneficial if a middlebox, such as a multipoint control unit (MCU) in multimedia conferencing, could efficiently adjust the forwarded streams according to the receiver capabilities.
Another situation in which such decisions should be made involves when a file is played back in a device or with software that is capable of decoding a subset of the stream only, such as the H.264/AVC compliant base layer or view of SVC or MVC bitstreams, respectively. Only the subset of the NAL units therefore needs to be processed. The video data to be played back by the media player may be in the format according to a file format container or in the format of an RTP stream. In any of the two cases, easy access of all the information that is helpful to decide which NAL units to be processed by the media player is desirable.
The SVC file format draft standard, referred to as MPEG document N8663, supports aggregation of multiple NAL units into one aggregator NAL unit. This is expected to be supported in the future MVC file format as well. Aggregator NAL units can both aggregate by inclusion NAL units within them (within the size indicated by their length) and also aggregate by reference NAL units that follow them (within the area indicated by the additional bytes field within them). When the stream is scanned by an AVC file reader, only the included NAL units are seen as “within” the aggregator. This permits, for example, an AVC file reader to skip a whole set of unneeded SVC or MVC NAL units. SVC NAL units refer to the SVC specific NAL units for which the NAL unit type values are reserved by the AVC specification. MVC NAL units refer to the MVC specific NAL units for which the NAL unit type values are reserved by the AVC specification. Similarly, if AVC NAL units are aggregated by reference, the AVC reader will not skip them and they remain in-stream for that reader. This aggregation mechanism adds complexities in accessing information needed to decide which NAL units to process by a media player.
Yet another situation in which such decisions should be made involves when an end-user receiving a scalable or multi-view stream decides to switch the layers or views, respectively, that he or she wants to decode and render. A corresponding request is transmitted via Session Identification Protocol (SIP) or Real-Time Streaming Protocol (RTSP), for example. As a response, the recipient of the request, such as a server or a middlebox, is supposed to select the layers or views that are forwarded. Due to inter-layer and inter-view prediction, immediate changes in the transmitted layers or views may not be desirable because (1) the resulting streams may not be standard-compliant, as some inter-layer and inter-view references may not be present in the decoder; (2) some of the transmitted data may not be decodable and hence not useful for the receivers; and (3) the non-decodable data wastes bitrate in the channel and may cause congestion and packet loss as well as increase transmission delay. The transmitter should therefore respond to the request from the next possible layer-switch or view-switch position.
Additionally, it is noted that redundant pictures provide a mechanism for a system to recover from transmission errors when corresponding primary coded pictures are damaged. The transmission of redundant pictures is unnecessary, however, if the redundant pictures themselves cannot be correctly decoded, the corresponding primary coded pictures are correctly decodable, or the decoding of redundant pictures is not supported in the receiver. A sender or a middlebox may therefore omit the transmission of redundant pictures or part thereof in the several cases. A first such case involves when the reference pictures for redundant pictures are not correctly decoded. This can be concluded e.g. from generic NACK feedback of RTP/AVPF or slice loss indication feedback of RTP audio-Visual Profile With Feedback (RTP/AVPF). A second case is when a redundant picture is not integral when it arrives a middlebox, i.e. a slice of a redundant picture is lost in the channel between a sender and a middlebox. This can be concluded in the middlebox, e.g. based on RTP sequence numbers of incoming packets and the content of the previous and subsequent RTP packet of the lost one. A third case is when a reliable communication protocol is used for transmission, when there is sufficient time for selective retransmissions of damaged primary coded pictures, or when network conditions are detected to be loss-free. A fourth such condition is when a receiver signals that no redundant pictures are supported—either implicitly via supported profiles or explicitly with the redundant-pic-cap MIME/SDP parameter, for example.
Still another situation in which decisions as to which NAL units should be transmitted and/or processed such decisions can be made involves when bitrate adaptation is required to trim the transmitted bitrate according to the throughput of a bottleneck link, for congestion avoidance, or for adjustment of network or client buffers. In this case, the sender or the middlebox should make a sophisticated decision which NAL units are not transmitted. One function of media-aware gateways or RTP mixers (which may be multipoint conference units, gateways between circuit-switched and packet-switched video telephony, PoC servers, IP encapsulators in DVB-H system, or set-top boxes that forward broadcast transmissions locally to home wireless network, for examples) is to control the bitrate of the forwarded stream according to prevailing downlink network conditions. It is desirable to control the forwarded data rate without extensive processing of the incoming data, i.e. by simply dropping packets or easily identified parts of packets.
When using the non-interleaved and interleaved packetization modes of H.264/AVC and SVC RTP payload formats, some of the common characteristics of the NAL units contained in the packet can only be identified when each contained NAL unit is examined. The examination may require partial decoding of the NAL unit. For example, the sub-sequence information SEI message should be decoded in order to find temporal level switching points and the slice header has to be decoded to find out if a coded slice belongs to a primary coded picture or a redundant coded picture.
Middleboxes should usually drop entire pictures or picture sequences so that the resulting stream remains valid. The interleaved packetization mode of H.264/AVC RTP payload specification allows encapsulation of practically any NAL units of any access units into the same RTP payload (called aggregation packet). In particular, it is not required to encapsulate entire coded pictures in one RTP payload, but rather the NAL units of a coded picture can be split into multiple RTP packets. While this liberty is helpful for many applications, it causes the following complications in a middlebox operation. First, given an aggregation packet, it is not known to which pictures its NAL units belong to before parsing the header of each NAL unit contained in the aggregation packet. Thus, when the interleaved packetization mode is applied, each aggregation unit header and NAL unit header should be parsed to map them to correct pictures. When redundant pictures are present, parsing into slice headers are further required. Second, it may not be possible to identify a characteristic of a NAL unit without the presence of some other NAL units of the same access unit. For example, in order to find out if a coded slice is part of an access unit that can be randomly accessed, the recovery point SEI message for the access unit must first be received and decoded.
Therefore, there is a need to provide easily accessible information in transport packets or file format aggregation NAL units based on which a network middlebox or a media player can decide which coded data units to be transmitted and/or processed. U.S. patent application Ser. No. 11/622,430, filed Jan. 11, 2007 and incorporated herein by references, discloses an indirect aggregator NAL unit for the SVC file format and the RTP payload format to indicate the scalability characteristics of certain NAL units following the indirect aggregator NAL unit. However, characteristics beyond scalability information for SVC were not considered, including whether the coded data units contained in the transport packet are (1) parts of redundant pictures, 2) parts of temporal layer switching points, (3) parts of view random access points, (4) parts of random access points that are not instantaneous decoding refresh (IDR) pictures, and 5) parts of pictures of a certain view identified by a view identifier.