Video coding is a way of transforming a series of video images into a compact digitized bit-stream so that the video images can be transmitted or stored. An encoding device is used to code the video images, with an associated decoding device being available to reconstruct the bit-stream for display and viewing. A general aim is to form the bit-stream so as to be of smaller size than the original video information. This advantageously reduces the capacity required of a transfer network, or storage device, to transmit or store the bit-stream code. To be transmitted, a video bit-stream is generally encapsulated according to a transmission protocol that typically adds headers and check bits.
Streaming media data over a communication network typically means that the data representing a media presentation are provided by a host computer, referred to as a server, to a playback device, referred to as a client device, over the communication network. The client device is generally a media playback computer implemented as any of a variety of conventional computing devices, such as a desktop Personal Computer (PC), a tablet PC, a notebook or portable computer, a cellular telephone, a wireless handheld device, a personal digital assistant (PDA), a gaming console, etc. The client device typically renders a streamed content as it is received from the host (rather than waiting for an entire file to be delivered).
A media presentation generally comprises several media components such as audio, video, text, and/or subtitles that can be sent from a server to a client device for being jointly played by the client device. Those media components are downloaded by the client device from a server. A common practice aims at giving access to several versions of the same media component so that the client device can select one version as a function of its characteristics (e.g. resolution, computing power, and bandwidth).
Recently, the Moving Picture Experts Group (MPEG) published a new standard to unify and supersede existing streaming solutions over HTTP (HyperText Transfer Protocol). This new standard, called “Dynamic adaptive streaming over HTTP (DASH)”, is intended to support a media-streaming model over HTTP based on standard web servers, in which intelligence (i.e. selection of media data to stream and dynamic adaptation of the bit-streams to user choices, network conditions, and client capabilities) relies exclusively on client choices and devices.
In this model, a media presentation is organized in data segments and in a manifest called “Media Presentation Description (MPD)” that represents the organization of timed media data to be presented. In particular, a manifest comprises resource identifiers to use for downloading data segments and provides the context to select and combine those data segments to obtain a valid media presentation. Resource identifiers are typically HTTP-URLs (Uniform Resource Locator), possibly combined with byte ranges. Based on a manifest, a client device determines at any time which media segments are to be downloaded from a media data server according to its needs, its capabilities (e.g. supported codecs, display size, frame rate, level of quality, etc), and depending on network conditions (e.g. available bandwidth). In the context of DASH standard, this manifest conforms to the extensible markup language (XML) standard.
Before a client device requests media data, it receives a MPD file so as to obtain a description of each accessible media segment and thus, request only the required media segments. In other words, by analyzing a received MPD file, a client device can obtain items of information of the accessible media segments of a media presentation, comprising, in particular, the addresses (e.g. http addresses) of the segments. Therefore, it can decide which media segments are to be downloaded (via HTTP requests), obtain these media segments, and play them after reception and decoding.
In addition to this association, the DASH standard proposes to split each media component into media sub-components according to small periods of time. The time decomposition is added in the MPD file. Accordingly, the MPD file provides links between http addresses (or URLs) and compact descriptions of each media segment over small periods of time, allowing a client device to download desired media segments of the media presentation over desired periods of time.
Since video resolution continuously increases, going from standard definition (SD) to high definition (HD), and to ultra-high definition (e.g. 4K2K or 8K4K), since not all receiving and video decoding devices have resources (e.g. network access bandwidth or CPU (Central Processing Unit)) to access video in full resolution, and since not all users need to access such video, it is particularly advantageous to provide the ability of accessing only some Regions of Interest (ROIs) that is to say to access only some spatial sub-parts of a whole video sequence.
A known mechanism to access spatial sub-parts of frames belonging to a video consists in organizing each frame of the video as an arrangement of independently decodable spatial areas generally referred to as tiles. Some video formats such as SVC (Scalable Video Coding) or HEVC (High Efficiency Video Coding) provide support for tile definition. A user-defined ROI may cover one or several contiguous tiles.
Accordingly, for streaming user-selected ROIs according to HTTP protocol, it is important to provide encapsulation of timed media data of an encoded video bit-stream in a way that enables spatial access to one or more tiles and that enables combination of accessed tiles.
It is to be recalled that encoded video bit-streams are generally constructed as a set of contiguous temporal samples that correspond to complete frames, the temporal samples being organized as a function of the decoding order. File formats are used to encapsulate and describe such encoded bit-streams.
For the sake of illustration, the International Standard Organization Base Media File Format (ISO BMFF) is a well-known flexible and extensible format that describes encoded timed media data bit-streams either for local storage or transmission via a network or via another bit-stream delivery mechanism. This file format is object-oriented. It is composed of building blocks called boxes that are sequentially or hierarchically organized and that define parameters of the encoded timed media data bit-stream such as timing and structure parameters.
A solution for describing tiles in ISO BMFF standard consists in encapsulating each tile into a particular track and in using the track's transformation matrix to signal tile positions. A natural approach using DASH standard would consist in describing each track in the manifest as independent media content. However, since current MPD definition does not allow tiled timed media data to be described, there is no way to signal that each track is a sub-part of the same video in the MPD.
Therefore, in practice, a client device would have to download a first initialization segment (in addition to the manifest) in order to be in position of determining that each video component described in the MPD is a sub-part of a tiled video (via track and matrix definitions, e.g. in boxes known as moov/track/tkhd). Next, the client device would have to download, at the minimum, the beginning of each first media data segment of each video component to retrieve the association between tile locations and video component (e.g. via the boxes known as moof/traf/tfhd). The downloading of this initialization information leads to delays and additional http roundtrips.
FIG. 1 illustrates schematically the use of tiles for streaming regions of interest of video sequences.
As illustrated, multiple resolution layers are computed from a high spatial resolution input video 100 comprising a set of images 105-1 to 105-n and each layer is divided into tiles, each tile being encoded independently. Similarly to a conventional video stream, a base layer tile shows the whole video scene. When a user wants to zoom into the video, tiles in the higher resolution layers are retrieved to provide higher quality details. Therefore, a client device needs to decode and synchronize multiple tiles for rendering a particular region of interest.
Alternatively, an overlapping tiling scheme can be used so that only one tile is needed to satisfy any region of interest. To handle different display sizes and network conditions, each tile is encoded at different spatial and quality resolutions.
An example of manifest file corresponding to input video 100 is given in the Appendix (Extract of code 1). According to this example, each image of high spatial resolution input video 100 comprises four segments arranged in a 2×2 matrix. The address of each segment and the position of the corresponding segment in the image are provided within the manifest.
US patent application US20100299630 discloses a system for visualizing regions of interest in panoramic images. However, only the case of pre-generated regions of interest (at the server end) and cropped images (at the client device end) are considered. It does not disclose any dynamic streaming of a user-selected region of interest.
<<In the article entitled “An interactive region-of-interest video streaming system for online lecture viewing”, published in Packet Video Conference 2010, the authors mention the use of tiles for streaming regions of interest. A manifest is used to provide identifier and location items of information of the tiles (actually H.264 slices). However, even if the tiling configuration of each resolution layer is described in the manifest file, such a description does not provide a URL per tile. Furthermore, it requires some intelligence at the server end to interpret the specific http queries sent by the client to stream selected tiles. Indeed, from a base URL and tile items of information provided by the proprietary manifest (tile position and identifier), a client device can build a query of the HTTP GET query URL type, e.g. GET xxx?id=val, to access a particular tile, identified by the value of the identifier attribute read from the manifest. However, such a type of URL requires processing tasks at the server end to retrieve the file and byte-range in the file to be sent to the client device to fulfill its request. Moreover, it does not allow signaling tiles composition and/or exclusion items of information in the manifest.
According to patent application WO2012168365, a manifest file describes one or more spatial segment streams with their location information (URL) and a client device has the possibility to select one or more spatial areas. The manifest file also describes relationships between spatial segments, in particular to match a spatial area across resolution levels. However, a synchronization engine is required at the client end to provide the ability of streaming and displaying more than one tile at a time. Such a synchronization engine, when using DASH, requires timed segments in the manifest and reordering of the frames in the client device. The decoded spatial segment frames are stitched together for display as the selected region of interest.
To solve these issues, there is provided an efficient partition or tile description scheme for manifest, which ensures, whatever track combination is selected by a client application, that the result of the ISO BMFF parsing always leads to a valid video elementary bit-stream for the video decoder.