Video coding is a way of transforming a series of video images into a compact digitized bit-stream so that the video images can be transmitted or stored. An encoding device is used to code the video images, with an associated decoding device being available to reconstruct the bit-stream for display and viewing. A general aim is to form the bit-stream so as to be of smaller size than the original video information. This advantageously reduces the capacity required of a transfer network, or storage device, to transmit or store the bit-stream code. To be transmitted, a video bit-stream is generally encapsulated according to a transmission protocol that typically adds headers and check bits. Video streaming mechanisms are widely deployed and used over the Internet network and mobile networks to stream audio/video media over HTTP (HyperText Transfer Protocol) such as 3GPP′s Adaptive HTTP Streaming (AHS), Microsoft's Smooth Streaming or Apple's HTTP live streaming for instance.
Recently, the Moving Picture Experts Group (MPEG) published a new standard to unify and supersede existing streaming solutions over HTTP. This new standard, called “Dynamic adaptive streaming over HTTP (DASH)”, is intended to support a media-streaming model over HTTP based on standard web servers, in which intelligence (i.e. selection of media data to stream and dynamic adaptation of the bit-streams to user choices, network conditions, and client capabilities) relies exclusively on client choices and devices.
In this model, a media presentation is organized in data segments and in a manifest called “Media Presentation Description (MPD)” which represents the organization of timed media data to be presented. In particular, a manifest comprises resource identifiers to use for downloading data segments and provides the context to select and combine those data segments to obtain a valid media presentation. Resource identifiers are typically HTTP-URLs (Uniform Resource Locator), possibly combined with byte ranges. Based on a manifest, a client device determines at any time which media segments are to be downloaded from a media data server according to its needs, its capabilities (e.g. supported codecs, display size, frame rate, level of quality, etc.), and depending on network conditions (e.g. available bandwidth).
It is to be noted that there exist alternative protocols to HTTP, for example the Real-time Transport Protocol (RTP).
In addition, video resolution is continuously increasing, going from standard definition (SD) to high definition (HD), and to ultra-high definition (e.g. 4K2K or 8K4K, that is to say video comprising images of 4,096×2,400 pixels or 7,680×4,320 pixels). However, not all receiving and video decoding devices have resources (e.g. network access bandwidth or CPU (Central Processing Unit)) to access video in full resolution, in particular when video is of ultra-high definition, and not all users need to access such video. In such a context, it is particularly advantageous to provide the ability of accessing only some Regions-of-Interest (ROls) that is to say to access only some spatial sub-parts of a whole video sequence.
A known mechanism to access spatial sub-parts of frames belonging to a video consists in organizing each frame of the video as an arrangement of independently decodable spatial areas generally referred to as tiles. Some video formats such as SVC (Scalable Video Coding) or HEVC (High Efficiency Video Coding) provide support for tile definition. A user-defined ROI may cover one or several contiguous tiles.
Accordingly, for streaming user-selected ROls according to HTTP protocol, it is important to provide encapsulation of timed media data of an encoded video bit-stream in a way that enables spatial access to one or more tiles and that enables combination of accessed tiles.
It is to be recalled that encoded video bit-streams are generally constructed as a set of contiguous temporal samples that correspond to complete frames, the temporal samples being organized as a function of the decoding order. File formats are used to encapsulate and describe such encoded bit-streams.
For the sake of illustration, the International Standard Organization Base Media File Format (ISO BMFF) is a well-known flexible and extensible format that describes encoded timed media data bit-streams either for local storage or transmission via a network or via another bit-stream delivery mechanism. This file format is object-oriented. It is composed of building blocks called boxes that are sequentially or hierarchically organized and that define parameters of the encoded timed media data bit-stream such as timing and structure parameters. According to this file format, the timed media data bit-stream is contained in a data structure referred to as mdat box that is defined in another data structure referred to as track box. The track represents a timed sequence of samples where a sample corresponds to all the data associated with a single timestamp that is to say all the data associated with a single frame or all the data associated with several frames sharing the same timestamp.
For scalable video such as video of the SVC format, the layered media data organization can be efficiently represented by using multiple dependent tracks, each track representing the video at a particular level of scalability. In order to avoid data duplication between tracks, extractors can be used. According to a standard file format, an extractor is a data structure directly included in a bit-stream that enables efficient extraction of network abstraction layer (NAL) units from other bit-streams. For instance, the bit-stream of an enhancement layer track may comprise extractors that reference NAL units from a base layer track. Then later on, when such enhancement layer track is extracted from the file format, extractors must be replaced by the data that they are referencing.
Several strategies can be adopted when using ISO BMFF embedding these mechanisms to describe sub-information and to ease access to this sub-information or to efficiently organize bit-streams into multiple segments.
For example, in the article entitled “Implications of the ISO Base Media File Format on Adaptive HTTP Streaming of H.264/SVC”, the authors, Kofler et al., present three different strategies for organizing a scalable video bit-stream (H264/SVC) for HTTP streaming considering possibilities as well as limitations of the ISO BMFF:
a) a single file containing a particular file header comprising a file type box “ftyp” and a movie box “moov” containing all ISO BMFF metadata (including track definitions), the single file also comprising a single mdat box containing the whole encoded bit-stream. This organization is suitable for local storage but is not adapted to HTTP streaming where a client may only need a part of the whole bit-stream;
b) a single file containing multiple moof/mdat boxes suitable for fragmentation. This format allows for progressive download. The moof box is equivalent to the moov box at fragment level. According to this scheme, using a fragmented media file, the scalable bit-stream is split into multiple dependent tracks representing the video at different scalability levels. Extractors are used to reference NAL units from other tracks. In case a track per tile is used, all addressable tracks have to be prepared in advance and tracks cannot be selected independently. If several tiles are to be displayed, several bit-streams must be decoded and the base layer is decoded several times;
c) multiple segments files, each file being accessible by its own URL and being downloadable independently. Each segment typically consists of a segment type box (styp), which acts as a kind of file header, an optional segment index box (sidx) and one or multiple fragments. Again, each fragment consists of a moof and a mdat box. According to this scheme, using a fragmented media file, each track is stored in its own segment with the associated bit-stream related to one level of scalability. If necessary, extractors are used to reference required bit-stream from dependent tracks. Such a coding scheme is particularly suitable for streaming tracks independently. It is well adapted to the DASH standard but it is not suitable for tile streaming since several bit-streams are to be decoded and thus, one decoder per track is required. Moreover, there is a potential duplication of the base layer's bit-stream when selecting more than one tile.
When applied to spatial tiles, none of these strategies allows efficient access to specific tiles in the context of HTTP streaming. Indeed with existing file format definition, it would still be necessary to access a multiple number of non-continuous byte ranges in an encoded bit-stream or it would result in bit-stream duplication in order to display spatial tiles of several frames corresponding to a given time interval.
To solve these issues, there is provided an efficient data organization and track description scheme suitable for handling spatial tiles in multi-layer video streams, which ensures, whatever track combination is selected by a client application, that the result of the ISO BMFF parsing always leads to a valid video elementary bit-stream for the video decoder, requiring a low description overhead.