There is an increased interest in Hyper Text Transport Protocol (HTTP) streaming of media, in particular video. This has evolved beyond simple progressive download to give two new features: adaptivity and live content. The way this is achieved is that the content is partitioned into multiple segments, or files, each corresponding to a small interval of content, for example 10 seconds of content. The client is provided with a manifest file also known as a Media Presentation Description, (MPD) which lists the different segments and where to fetch them and the client fetches them one by one. The split into different segments/files that are fetched via a standard web protocol like HTTP, is also said to be cache-friendly, or Content Distribution Network (CDN) friendly, since it does not require any state in the server or cache, in contrast to streaming servers based on protocols like Real Time Streaming Protocol (RTSP).
3GPP has recently standardized a solution for HTTP Streaming called Adaptive HTTP Streaming (AHS) in Release 9 of PSS. An extended version is called 3GP-DASH (Dynamic Adaptive Streaming over HTTP) and is currently being specified in Release 10 . The Motion Picture Experts Group (MPEG) is currently standardizing Dynamic Adaptive Streaming over HTTP (DASH) based on 3GPP AHS.
The AHS and DASH solutions, as well as other solutions for HTTP streaming use two different types of files that are fetched by the client from the server. The first type is a manifest describing the session, and in particular the different variants of the content that are available. In AHS and DASH the manifest is an MPD file and provides information about the different periods, and then the segments of the different representations of the media inside a period. The second type is the media itself, which is contained in media files. In AHS and DASH these are ISO (International Organization for Standardization) file-format based, and consist of initial segments and media segments.
In order to quickly navigate in a media it is common to allow for trick modes and alternate playout rates by fast forward or rewind, i.e., to play a representation of the media stream faster at a higher speed or even backwards. This way a user can visually search through the stream and start normal playback at a wanted position.
The simplest method of “fast forward” is to playback a stream at faster than its original rate. This method has the drawback of requiring a lot of processing power, as well as an increased download rate if the content is on a remote server. For example, to be able to fast forward at 10× speed, ten times the decoding complexity would be needed. FIG. 1 shows fast forward where the speed is two times the normal speed.
A simplified method of fast forward is to play back only the I-frames or more precisely, the Random Access Points (RAPs), i.e. the key frames. This reduces the complexity a lot depending on the distance between the I-frames. By decoding every second I-frame, it is possible to make a still faster trick-mode, but it is not easy to make a trick-mode corresponding to a fractional image distance, if the video is not encoded using temporal levels. FIG. 2 shows the jumping between I-frames to enable fast forward.
Another drawback of this solution is the high overhead (bandwidth demand) of this solution. The complete stream (all frames) must be sent to the receiver who filters and discards the “unwanted” frames (majority of frames).
The frames also referred to as samples can be divided into temporal levels. All samples within one temporal level can only depend on samples within the same temporal level or on samples within lower temporal levels. An example of a video codec that supports temporal levels is H.264 (MPEG-4 AVC), which is the state of the art of video coding standards. It is a hybrid codec which takes advantages of eliminating redundancy between pictures (using B and P pictures) in addition to redundancy within pictures.
H.264 supports several ways of restricting dependencies between pictures such that subsets of independent frames can be extracted from the bitstream and decoded without using any of the remaining pictures of the stream. One can for instance extract I frames, which in the case of a fixed Group of Pictures (GOP) structure will appear regularly in the media stream. Other options are to exploit hierarchical B and/or P pictures to extract a temporal level of the media stream.
An example of temporal scalability coding structure is shown in FIG. 3. In FIG. 3 pictures (I, P and B) are indexed with their level 0, 1, 2 or 3. It is shown that samples within one temporal level only depend on samples within the same temporal level or on samples within lower temporal levels.
Accordingly, different fast forward (ff) speeds (times normal playout) can be obtained as follows:
ff×8 is achieved by using pictures of level 0,
ff×4 is achieved by using pictures of levels 0 and 1,
ff×2 is achieved by using pictures of levels 0, 1 and 2,
normal playback corresponds to all levels 0, 1, 2 and 3.
3GPP and MPEG base their HTTP Streaming delivery formats on the 3GP file format and the MP4 file format, respectively, which in turn are based on the ISO base media file format.
The file structure of a 3GP or MP4 file is object-oriented and a file is formed by a series of objects called boxes. The structure of a box is inferred by its type. Some boxes only contain other boxes, whereas most boxes contain data. All data of a file is contained in boxes.
A file can be divided into an initial movie metadata part, contained in a movie box of type ‘moov’, and a number of incremental movie fragments, contained in movie-fragment boxes of type ‘moof’. Each movie fragment extends the movie (multimedia presentation) in time. The movie box and the movie fragment boxes are meta-data boxes containing the information needed by a client to decode and render the media presentation. The actual media data is stored in media-data boxes of type ‘mdat’. All these boxes (‘moov’, ‘moof’, and ‘mdat’) are top-level boxes, i.e. contained by the file only and not by any other boxes.
For 3GP-DASH and MPEG DASH, segmented versions of 3GP and MP4 files are used. There are two main types of segments:
Initialization segment: contains a movie box (‘moov’) but no movie fragments.
Media segment: contains one or more movie fragments (‘moof) and corresponding media-data in media-data boxe(s) (‘mdat’) but no movie box.
As described above for HTTP Streaming, a client first needs an MPD, which includes pointers to relevant initialization and media segments. HTTP streaming is then initialized by a client by downloading an initialization segment (or several, in case parallel representations are used for e.g. audio and video). After that the client continues the HTTP streaming session by downloading media segments as described in the MPD.
Pictures are stored as access units in the file format. By default they appear in decoding order in a bitstream. FIG. 4 shows an example where access units are stored in their default bitstream order in a move fragment. There are three temporal levels (0, 1 and 2).
FIG. 4 shows access units with different temporal levels in a fragment in decoding order: I(0), P(4), B(2), B(1), B(3), P(8), B(6), B(5), B(7), etc. The numbers in parentheses denote presentation order, i.e. the order the frames are rendered on the screen: I(0), B(1), B(2), B(3), P(4), B(5), B(6), B(7), P(8), etc. I(0), P(4), P(8), . . . , P(36) are in temporal level 0; B(2), B(6), . . . , B(34) are in temporal level 1; B(1), B(3), B(5), B(7), . . . , B(33), B(35) are in temporal level 2.
If a client wants to download only the samples belonging to a certain temporal level, it needs to issue several HTTP GET byte-range requests in order to avoid downloading more data than necessary. For instance, if the client wants to download temporal level 0, it needs to download access units I(0), P(4), P(8), etc., corresponding to fast forward ×4. For ff×2, temporal levels 0 and 1 would be needed, i.e. access units I(0), P(4), B(2), P(8), B(6), etc.