In video streaming or video-on-demand services, because of the dynamic network conditions, the end-to-end transmission characteristics between the server and the client may change frequently. For example, the transmission bitrate may be reduced. To maintain the continuity of the streaming session and to maximize the Quality of Service, the server should adapt the transmitted stream to the changing transmission conditions. This process is called stream adaptation.
Stream adaptation is either multi-encoding based or transcoding based. In multi-encoding based stream adaptation, the server stores the same video content in a plurality of encoded streams of different forms or with different parameters, and the transmitted data in the encoded streams may be switched between different streams. In transcoding based stream adaptation, the server contains a transcoder to transcode a stream to different forms or with different parameters.
To enable switching from one bitstream to another, the switched-to bitstream must contain switching points, such that the client-side decoder can still receive image data of acceptable decoding quality after switching. Switching points can be random access points or non-random access points. SP/SI pictures can be used for stream switching at non-random access points. Random access points, however, are natural switching points.
Random access refers to the ability of the decoder to start decoding a stream at a point in the stream other than the beginning of the stream, and to recover an exact or approximate representation of the decoded pictures. Thus, a random access point is a switching point where decoding of any following coded picture can be initiated.
A random access point and a recovery point characterize a random access operation. All decoded pictures located at or subsequent to a recovery point in the output order are correct or approximately correct in content. If the random access point is the same as the recovery point, the random access operation is Instantaneous Decoding Refresh (IDR), otherwise it is Gradual Decoding Refresh (GDR). IDR points in a video stream can be used in fast forward and random access, but they can also be used for error resiliency and recovery. IDR is also used in bitrate adaptation by stream switching, especially on the server side.
IDR pictures are pictures that are coded without any reference to other pictures, and all the pictures following an IDR picture in decoding order are coded without reference to any earlier picture than the IDR picture in decoding order, whereas GDR can be implemented using the technique called isolated regions as described later in this document. The picture at a GDR random access point is called a GDR picture. The period from the GDR picture to the recovery point, inclusive, is called the GDR period.
Random access points render it possible to seek operations in locally stored video streams. In video-on-demand or streaming, servers can respond to seek requests by transmitting data starting from the random access point that is closest to the requested destination of the seek operation. Switching between coded streams of different bit-rates is a method that is used commonly in unicast streaming for the Internet to match the transmitted bitrate to the expected network throughput and to avoid congestion in the network. Switching to another stream is possible at a random access point. Furthermore, random access points enable tuning in to a broadcast or multicast. In addition, a random access point can be coded as a response to a scene cut in the source sequence or as a response to an intra picture update request.
File Format
MPEG-4 Part 12 specifies ISO (International Organization for Standardization) base media file format. It is designed to contain timed media information for a presentation in a flexible, extensible format that facilitates interchange, management, editing, and presentation of the media. This presentation may be ‘local’ to the system containing the presentation, or may be carried out via a network or other stream delivery mechanism. The file structure is object-oriented in that a file can be decomposed into constituent objects, and the structure of the objects can be inferred directly from their type. The file format is designed to be independent of any particular network protocol while enabling efficient support for them in general. ISO base media file format is used as the basis for MP4 file format (MPEG-4 Part 14) and AVC (Advanced Video Coding) file format (MPEG-4 Part 15). AVC file format specifies how AVC content is stored in an ISO base media file format. It is normally used in the context of a specification, such as the MP4 file format, derived from ISO base media file format that permits the use of AVC video.
In the current design of AVC file format, the switching pictures formed by SP/SI pictures are stored in switching tracks, which are tracks separate from the track that is being switched from and the track being switched to. Switching tracks can be identified by the existence of a specific required track reference in that track. A switching picture is an alternative to the sample in the destination track that has exactly the same decoding time.
Each IDR random access point corresponds to a sync sample indicated in the Sync Sample Box. The design of Sync Sample Box is specified in the ISO base media file format as follows:
DefinitionBox Type:‘stss’Container:Sample Table Box (‘stbl’)Mandatory:NoQuantity:Zero or one
This box provides a compact marking of the random access points within the stream. The table is arranged in strictly increasing order of sample number. If the sync sample box is not present, every sample is a random access point.
Syntaxaligned(8) class SyncSampleBox extends FullBox(‘stss’, version = 0, 0) { unsigned int(32) entry_count; int i; for (i=0; i < entry_count; i++) {  unsigned int(32) sample_number; }}Semantics                version is an integer that specifies the version of the box.        entry_count is an integer that gives the number of entries in the following table. If entry_count is zero, there are no random access points within the stream and the following table is empty.        sample_number gives the numbers of the samples that are random access points in the stream.Isolated Regions        
The isolated regions technique provides an elegant solution for many applications, such as GDR (gradual decoding refresh) (JVT-C074), error resiliency and recovery (JVT-C073), region-of-interest coding and prioritization, picture-in-picture functionality, and coding of masked video scene transitions (JVT-C075). With GDR being based on isolated regions, media channel switching for receivers, bitstream switching for the server, and allowing newcomers for multicast streaming will be as easy as instantaneous random access with smoother bitrate.
An isolated region in a picture can contain any macroblock and a picture can contain zero or one isolated region, or more isolated regions that do not overlap. A leftover region is the area of the picture that is not covered by any isolated region of a picture. When coding an isolated region, all predictive coding within the same coded or decoded picture, herein referred to as in-picture prediction, is disabled across its boundaries. A leftover region may be predicted from isolated regions of the same picture.
A coded isolated region can be decoded without the presence of any other isolated or leftover region of the same coded picture. It may be necessary to decode all isolated regions of a picture before the leftover region. An isolated region contains at least one slice.
Pictures, whose isolated regions are predicted from each other, are grouped into an isolated-region picture group. An isolated region can be coupled with a corresponding isolated region in each earlier picture within the same isolated-region picture group. An isolated region can be inter-predicted from the corresponding isolated region within the same isolated-region picture group. However, inter prediction of an isolated region from other isolated regions is disallowed. In contrast, a leftover region may be inter-predicted from any isolated region. The shape, location, and size of coupled isolated regions may evolve from picture to picture in an isolated-region picture group.
Coding of isolated regions can be realized in the AVC codec applying slice groups. Each GDR random access point is characterized by a recovery point Supplemental Enhancement Information (SEI) message. Coding of isolated regions can also be realized in the AVC codec or other standard codecs without using slice groups, though the efficiency may be lower compared to the coding that uses slice groups.
SP/SI Pictures
The AVC coding standard supports SP/SI pictures. It is known that in stream switching involving only P-slices, the decoder will not have the correct decoded reference frames required in image reconstruction. By inserting an I-slice at regular intervals in the coded sequence to create switching points can solve this problem. However, an I-slice is likely to contain much more coded data than a P-slice. As such, a peak in the coded bitrate is resulted at each switching point. SP-slices and SI-slices are designed to support switching without the increased bitrate penalty of I-slices.
An SP/SI picture is encoded in such a way that another SP/SI picture using different reference pictures can have exactly the same reconstructed picture. SP/SI pictures can be applied for bitstream switching, splicing, random access, fast forward, fast backward and error resilience/recovery. For example, let us assume that there are two bitstreams, bs1 and bs2, of different bitrates, originated from the same video sequence. In bs1, an SP picture (s1) is coded, and another SP picture (s2) is coded at the same location in bs2. In bs1, an additional SP picture (s12) is coded having exactly the same reconstructed picture as s2. s12 and s2 use different reference pictures (from bs1 and bs2, respectively). Thus, switching from bs1 to bs2 can be carried out by transmitting s12 instead of s1 in the switching location. Since s12 has exactly the same reconstruction as s2, reconstructed pictures after switching are error-free. The SP picture s12 is called switching picture, which is stored in the switching track in AVC file format.
Streaming System
As mentioned earlier, in multi-encoding based stream adaptation, the server stores in a plurality of encoded streams the same video content, but only one of the encoded streams is selected for transmission. FIG. 1 depicts a transmitting system 10, which includes a server 20 capable of receiving a plurality of streams from a transcoder or multi-stream generator or storage device 12. As shown, the streaming server 20 comprises a stream selector 22 to select one of the encoded streams 1 to n. The selected encoded stream is divided into packets by a packetizer 24 and coded in a channel coder 26 for transmission. To maintain continuity of the streaming session and to maximize the Quality of Service, the server generally selects the best possible encoded stream for transmission. When the transmission condition changes, the server may have to increase or reduce the bitrate, for example. Accordingly, the stream selector switches streams by selecting a different encoded stream at a switching point. At the client side, however, the decoder can simply decode whatever transmission data it receives. Basically, a streaming client device 40 comprises a channel decoder 42, a de-packetizer 44 and a decoder 46 for providing decoded video signals to a display 48 for display, as shown in FIG. 2. However, in a streaming system that supports client-driven stream adaptation, the streaming client device can send a request signal to the server to request switching of the stream. The streaming system is shown in FIG. 3, which shows the connection between a streaming server 20 and a streaming client 40 through a network 60.
Instantaneous/Gradual Decoding Refresh
As mentioned earlier, a random access point is any picture from which decoding can be initiated. At such an access point, all decoded pictures at, or subsequent to, a recovery point are correct or approximately correct in content. It should be noted that the phrase “correct in content” as used in this disclosure means that the decoded slice or picture is exactly the same as when the decoding is started from the beginning of the stream, and the phrase “approximately correct in content” means that the decoded slice or picture is approximately the same as when the decoding is started from the beginning of the bitstream. As shown in FIG. 4a, the recovery point is the same as the switching point, and the pictures with correct or approximately correct in content start at the switching point. As such, the random access operation is referred to as Instantaneous Decoding Refresh (IDR). IDR random access points contain only I slices or SI slices.
In contrast, a Gradual Decoding Refresh (GDR) random access point can contain any kind of slices (I, P, SI, SP). As shown in FIG. 4b, however, the content in the picture is correct or approximately correct starting from a picture following the switching point in the output order. The pictures between the recovery point and the switching point may be visually annoying or otherwise unacceptable for viewing.
Currently, an efficient method to signal GDR switching points to be used in file format is lacking. An example of the file format is AVC file format, which is important for a server file containing streaming content with GDR based video coding to support stream switching. For AVC contents stored in the AVC file format, a GDR switching point can only be identified when an access unit contains a recovery point SEI message, as specified in the AVC standard. This method requires that each AVC access unit be checked to see whether there is a recovery point SEI message.