Video coding is a way of transforming a series of video images into a compact digitized bit-stream so that the video images can be transmitted or stored. An encoding device is used to code the video images, with an associated decoding device being available to reconstruct the bit-stream for display and viewing. A general aim is to form the bit-stream so as to be of smaller size than the original video information. This advantageously reduces the capacity required of a transfer network, or storage device, to transmit or store the bit-stream code. To be transmitted, a video bit-stream is generally encapsulated according to a transmission protocol that typically adds headers and check bits. Video streaming mechanisms are widely deployed and used over the Internet network and mobile networks to stream audio/video media over HTTP (HyperText Transfer Protocol) such as 3GPP's Adaptive HTTP Streaming (AHS), Microsoft's Smooth Streaming or Apple's HTTP live streaming for instance.
Recently, the Moving Picture Experts Group (MPEG) published a new standard to unify and supersede existing streaming solutions over HTTP. This new standard, called “Dynamic adaptive streaming over HTTP (DASH)”, is intended to support a media-streaming model over HTTP based on standard web servers, in which intelligence (i.e. selection of media data to stream and dynamic adaptation of the bit-streams to user choices, network conditions, and client capabilities) relies exclusively on client choices and devices.
In this model, a media presentation is organized in data segments and in a manifest called “Media Presentation Description (MPD)” which represents the organization of timed media data to be presented. In particular, a manifest comprises resource identifiers to use for downloading data segments and provides the context to select and combine those data segments to obtain a valid media presentation. Resource identifiers are typically HTTP-URLs (Uniform Resource Locator), possibly combined with byte ranges. Based on a manifest, a client device determines at any time which media segments are to be downloaded from a media data server according to its needs, its capabilities (e.g. supported codecs, display size, frame rate, level of quality, etc.), and depending on network conditions (e.g. available bandwidth).
It is to be noted that there exist alternative protocols to HTTP, for example the Real-time Transport Protocol (RTP).
In addition, video resolution is continuously increasing, going from standard definition (SD) to high definition (HD), and to ultra-high definition (e.g. 4K2K or 8K4K, that is to say video comprising images of 4,096×2,400 pixels or 7,680×4,320 pixels). However, not all receiving and video decoding devices have resources (e.g. network access bandwidth or CPU (Central Processing Unit)) to access video in full resolution, in particular when video is of ultra-high definition, and not all users need to access such video. In such a context, it is particularly advantageous to provide the ability of accessing and extracting only some parts of the video bit-stream that is to say, for example, to access only some scalability layer, views, or spatial sub-parts of a whole video sequence.
A known mechanism to access scalability layers, views, or spatial sub-parts of frames belonging to a video consists in organizing each frame of the video as an arrangement of layers, potentially with coding dependencies. Some video formats such HEVC (High Efficiency Video Coding) provide support for temporal, SNR (quality), and spatial scalability layers, for multiple views and/or for tile encoding. For example, a user-defined ROI may cover one or several contiguous tiles. In case of multi-view, a user may prefer stereo than single view. In case of scalability, the appropriate layer can be selected depending on user's device in terms of screen size or processing power for example.
To make it possible the selection, extraction, and transmission of only relevant parts of the video bit-stream (i.e. a sub-bit-stream), the organization of the video bit-stream (and more generally the organization of media data that may comprise video but also audio, metadata, subtitles, and the like) has to be exposed to media players. This organization is expressed as a list of operation points.
An operation point, also referred to as an operating point, represents a portion or a bit-stream subset of a Layered HEVC bit-stream which can be obtained by extracting a bit-stream portion consisting of all the data needed to decode this particular bit-stream subset and that can be decoded independently of other operation points. As a consequence, an operation point is a set of output layers associated with a range of temporal identifiers having values varying from zero to a selected maximum value, inclusive. For the sake of illustration, two temporal identifier values (0 and 1) corresponding to frame-rates of 30 Hz and 60 Hz are illustrated in FIGS. 5a and 5b. 
FIG. 5, comprising FIGS. 5a and 5b, illustrates examples of a layer configuration where a non-output layer is involved.
More precisely, FIG. 5a illustrates an example of the relation between several representations of a video sequence. These representations comprise representations having different temporal resolutions (i.e. frame rate of 30 Hz and of 60 Hz) and for each of the temporal resolutions, the representations comprise different views (i.e. left, right, and common).
As represented, the common view is directly derivable from the left and right views and the common view with a frame rate of 30 Hz is directly derivable from the common view with a frame rate of 60 Hz.
The representations also comprise non scalable representations of the full views according to each of the temporal resolutions.
As illustrated in FIG. 5b, the representations illustrated in FIG. 5a can be encoded according to three layers corresponding to the three possible views (i.e. left, right, and common) and according to the two frame rates (i.e. 30 Hz and 60 Hz).
Accordingly, for streaming video sequences or user-selected data according to HTTP protocol, it is important to provide encapsulation of timed media data of an encoded video bit-stream in a way that enables access to the selected data and their coding dependencies to transmit the minimum amount of data enabling the reconstruction, decoding and display of the user-selected data.
A typical usage of adaptive HTTP streaming is the bit-stream splicing.
FIG. 6, comprising FIGS. 6a, 6b, and 6c, illustrates an example of a bit-stream splicing application.
As illustrated in FIG. 6a, bit-stream splicing may consist in switching from a low resolution bit-stream (SD) to a high resolution bit-stream (HD). According to another example illustrated in FIG. 6b, bit-stream splicing may consist in switching from a live bit-stream to an on-demand bit-stream for replay.
In such cases, the spliced bit-stream denoted 603 results from the combination of the two alternative bit-streams denoted 601 and 602 having their organization described in their respective initialization segments (i.e. ‘moov’ and ‘trak’ boxes when encapsulated in accordance with the International Standard Organization Base Media File Format).
In the case according to which the two streams 601 and 602 have different operation points, the spliced bit-stream 603 should contain the concatenation of the two different operation point lists. This may arise, for example, when storing spliced HEVC bit-streams with different VPS (video parameter set).
However, it is not possible to dynamically associate the concatenation of two different operation point lists with a spliced bit-stream according to the current encapsulation format of L-HEVC. A similar limitation exists with video bit-streams having their layer organization, their scalability type or their profile, tier or level (actually any parameter in the operation point description) varying along time: the descriptive metadata (for example the hierarchy of ‘trak’ file format boxes) cannot be dynamically updated.
Such problems may be encountered, for example, when streaming a long-running fragmented MP4 file with changes in layer configuration. In such a case, a content producer defines two scalable layers for two classes of devices (e.g. SD, HD). If, after a period of time, a new class of device (e.g. UHD) is available, it should be possible to reuse the two tracks and to add an extra layer. In a configuration where movie fragments are used, the operation of removing fragments should not lead to information loss. If the layer configuration changes during the streaming this should be captured.
Another example is directed to the concatenation of files obeying to the same profile constraints. Such a concatenation may follow different strategies:                samples in tracks containing the base layers from both files could simply be concatenated, leading to multiple VPS/SPS/PPS in different sample entries or in larger hvcC NALU arrays;        samples from non-base layers could be concatenated by inspecting tracks one by one and concatenating them with samples from tracks corresponding to layers with similar constraints, if any;        samples from the non-base layer tracks of the second file could be added to new sets of tracks, shifted in time to maintain synchronization with the concatenated base track. The latter approach is complex and might not be preferable. In such scenario, it might be useful to allow for track-layer configuration changes        
These limitations result from the fact that the current descriptor for operation points is declared as one single instance for the whole mp4 file.
It is to be recalled that encoded video bit-streams are organized into NAL (Network Abstraction Layer) units which are generally constructed as a set of contiguous temporal samples that correspond to complete frames, the temporal samples being organized as a function of the decoding order. File formats are used to encapsulate and describe such encoded bit-streams.
For the sake of illustration, the International Standard Organization Base Media File Format (ISO BMFF) is a well-known flexible and extensible format that describes encoded timed media data bit-streams either for local storage or transmission via a network or via another bit-stream delivery mechanism. This file format is object-oriented. It is composed of building blocks called boxes that are sequentially or hierarchically organized and that define parameters of the encoded timed media data bit-stream such as timing and structure parameters. According to this file format, the timed media data bit-stream is contained in a data structure referred to as mdat box that is defined in another data structure referred to as track box. The track represents a timed sequence of samples where a sample corresponds to all the data associated with a single timestamp that is to say all the data associated with a single frame or all the data associated with several frames sharing the same timestamp.
For scalable video such as video of the multi-layer HEVC format, the layered media data organization can be efficiently represented by using multiple dependent tracks, each track representing the video at a particular level of scalability. In order to avoid data duplication between tracks, extractors can be used. According to a standard file format, an extractor is a specific kind of network abstraction layer (NAL) data structure directly included in a bit-stream that enables efficient extraction of other network abstraction layer (NAL) units from other bit-streams. For instance, the bit-stream of an enhancement layer track may comprise extractors that reference NAL units from a base layer track. Then later on, when such enhancement layer track is extracted from the file format, extractors must be replaced by the data that they are referencing.
Several strategies can be adopted when using ISO BMFF embedding these mechanisms to describe sub-information and to ease access to this sub-information or to efficiently organize bit-streams into multiple segments.
For example, in the article entitled “Implications of the ISO Base Media File Format on Adaptive HTTP Streaming of H.264/SVC”, the authors, Kofler et al., present three different strategies for organizing a scalable video bit-stream (H264/SVC) for HTTP streaming considering possibilities as well as limitations of the ISO BMFF:
a) a single file containing a particular file header comprising a file type box “ftyp” and a movie box “moov” containing all ISO BMFF metadata (including track definitions), the single file also comprising a single mdat box containing the whole encoded bit-stream. This organization is suitable for local storage but is not adapted to HTTP streaming where a client may only need a part of the whole bit-stream. Such an organization is preferably used for a file used as an initialization file when the bit-stream is fragmented into multiple segments. This initialization file is followed by one other single file whose organization is defined in b), this initialization file gathering information about all the segments;
b) a single file containing multiple moof/mdat boxes suitable for fragmentation each couple of moof/mdat being relative to one of the multiple segments of the bit-streams. This format allows for progressive download. More in detail, the moof box is equivalent to the moov box at fragment level. According to this scheme, using a fragmented media file, the scalable bit-stream can be split into multiple dependent tracks representing the video at different scalability levels. Extractors are specific NAL units used to reference NAL units from other track(s). In case a track per tile is used, all addressable tracks have to be prepared in advance and tracks cannot be selected independently. If several tiles are to be displayed, several bit-streams must be decoded and the base layer is decoded several times. The last organization described in c) is particularly suitable for selected each track independently;
c) multiple segments files, each file being accessible by its own URL and being downloadable independently. Each file is related to one fragment and the multiple segment files are preferably preceded by a dedicated initialization file. Each segment typically consists of a segment type box (styp), which acts as a kind of file header, an optional segment index box (sidx) and one or multiple fragments. Again, each fragment consists of a moof and an mdat box. According to this scheme, using a fragmented media file, each track is stored in its own segment with the associated bit-stream related to one level of scalability. If necessary, extractors are used to reference required bit-stream from dependent tracks. Such a coding scheme is particularly suitable for streaming tracks independently. It is well adapted to the DASH standard but it is not suitable for tile streaming since several bit-streams are to be decoded and thus, one decoder per track is required. Moreover, there is a potential duplication of the base layer's bit-stream when selecting more than one tile.
The definition of the boxes above mentioned as well as the definition of sub-boxes included in those boxes made in reference to the document known as “Draft text of ISO/IEC DIS 14496-15 4th edition, ISO/IEC JTC1/SC29/WG11, W15182, April 2015, Geneva, Switzerland” (named “w15182” below) may lead to complex and less efficient organization of the ISO BMFF metadata.
Moreover the tile tracks are not properly defined for Layered HEVC, limiting it usage.
To solve these issues and, in particular, to make it possible to dynamically set descriptors for operation points, there is provided an efficient data organization and track description scheme suitable especially for handling spatial tiles, scalable layers and multiple views in Layered HEVC for multi-layer video streams. This ensures that the result of the ISO BMFF parsing is more efficient and adapted to Layered HEVC.