A tiled video such as a video mosaic is an example of the combined presentation of multiple video streams of visually unrelated or related video content on one or more display devices. Examples of such video mosaics include TV channel mosaics comprising multiple TV channels in a single mosaic view for fast channel selection and security camera mosaics comprising multiple security video feeds in a single mosaic for a compact overview. Often personalization of a video mosaics is desired when different persons require different video mosaics, e.g.: a personalized TV channel mosaic wherein each user may have his own preferred set of TV channels, a personalized interactive electronic program guide (EPG) wherein each user is able to compose a video mosaic associated with TV programs indicated by the EPG or a personalized security camera mosaic wherein each security officer may have his own set of security feeds. The personalization may vary over time as user TV channel preferences may change, or as TV channels viewing rates fluctuate, in case when the video mosaic shows the currently most watched TV channels, and other security video feeds may become relevant for the security officer when he changes location. Additionally and/or alternatively, video mosaics may be interactive, i.e. configured to be responsive to user inputs. For example, the TV may switch to a particular channel when the user selects a specific tile from a TV channel mosaic.
WO2008/088772 describes a conventional process for generating a video mosaic. This process includes selecting different video's and a server application processing the selected video's such that a video stream representing the video mosaic can be transmitted to a client device. The video processing may include decoding the video's, spatially combining and stitching video frames of the selected video's in the decoded domain and re-encoding the video frames into a single video stream. This process requires a lot of recourses in terms of decoding/encoding and caching. Further, the double encoding process, firstly at the video source and secondly at the server, results in quality degradation of the original source videos.
The article by Sanchez et al, “Low Complexity cloud-video-mixing using HEVC”, 11th annual IEEE CCNC—Multimedia networking, services and applications 2014, pp. 214-218, describes a system for creating a video mosaic for video conferencing and surveillance applications. The article describes a video-mixer solution that is based on the standard-compliant HEVC video compression standard. Different HEVC video streams associated with different video content are combined in the network by rewriting metadata associated with NAL units in these video streams. A server thus rewrites incoming NAL units comprising encoded video content of a video streams and combines/interlaces those into an outgoing stream of NAL units representing a tiled HEVC video stream wherein each HEVC tile represents a subregion of the image region of a video mosaic. The output of the video mixer can be decoded by a standard-conformant HEVC decoder module by putting special constraints on the encoder module. Hence, Sanchez describes a solution for combining the video content in the encoded domain so that the need for resource intensive processes including decoding, stitching in the decoded domain and re-encoding is eliminated or at least substantially reduced.
A problem with the solution proposed by Sanchez is that the created video mosaic requires dedicated processes on the server so the required server processing capacity only scales linearly, i.e. poorly, with the number of users. This is a major scalability issue when offering such services at a large scale. Further, the client-server signaling protocol introduces a delay as it takes time to send a request for a specific mosaic and then—in response to the request—compose that video mosaic and transmit the video mosaic to the client. Additionally, the server forms both a single point of failure for all streams delivered by that server as well as a single point of control, which poses a risk in terms of privacy and security. Finally, the system proposed by Sanchez et al does not allow third party content providers. All the content offered to the clients need to be known by a central server responsible for combining the video's.
Transferring the video mixer functions of Sanchez to the client-side may partly solve the above-mentioned problems. However, this would require the client to parse the HEVC encoded bitstream, to detect the relevant parameters and headers, and to rewrite the headers of the NAL units. Such capabilities require data storage and processing power that go beyond a commercial off-the-shelf standard-conformant HEVC decoder module.
Further, current HEVC technology does not offer functionality that is needed for selecting different HEVC tile streams associated with different tile positions and different content sources. For example, in the ISO contribution ISO/IEC JTC1/SC29/WG11 MPEG2014/M33210 of March 2014, scenarios are described how spatially related HEVC tiles can be signaled to an DASH client and how such HEVC tile can be downloaded without the need to download all other tiles. This document describes a scenario wherein one video source is encoded in HEVC tiles that are stored as HEVC tile tracks in a single file (a single ISOBMFF data container produced by one encoding process) stored on a server. A manifest file (referred to in DASH as a media presentation description or MPD) describing the HEVC tiles in the data container can be used for selecting and playout one of the stored HEVC tile tracks. Similarly, WO2014/057131 describes a process for selecting a subset of HEVC tiles (a region of interest) from a set of HEVC tiles originating from one single video (i.e. HEVC tiles that are formed by encoding a single video source) on the basis of an MPD.
MITSUHIRO HIRABAYASHI ET AL: “Considerations on HEVC Tile Tracks in MPD for DASH SRD”, 108. MPEG MEETING; 31Mar. 2014-4Apr. 2014; VALENCIA; MOTION PICTURE EXPERT GROUP OR ISO/IEC JTC1/SC29/WG11, m33085, 29 Mar. 2014 describes manners for annotating HEVC Tile Tracks of a HEVC Stream with DASH SRD descriptors. Two use case are described. One use case assumes all HEVC Tile Tracks and associated HEVC Base Tracks to be included in a single MP4 file. In this case it is suggested to map all HEVC Tile Tracks and the HEVC Base Track to subrepresentations. The other use case assumes each of the HEVC Tile Tracks and the HEVC Base Track to be included in separate MP4 files. In this case it is suggested to map all HEVC Tile Tracks MP4 files and the HEVC Base Track MP4 files onto Representations within an AdaptationSet.
It should be noted that according to section 2.3 and 2.3.1 all HEVC Tile Tracks describing tile video's relate to the same HEVC Stream, which implies they are the result of a single HEVC encoding process. This further implies all these HEVC Tile Tracks relate to the same input (video) stream entering the HEVC encoder.
GB 2 513 139 A (CANON KK [JP]), 22 Oct. 2014 discloses a method for streaming video data using the DASH standard, each frame of the video being divided into n spatial tiles, n being an integer, in order to create n independent video sub-tracks. The method comprises: transmitting, by a server, a (MPD) media presentation description file to a client device, said description file including data about the spatial organization of the n video sub-tracks and at least n URLs respectively designating each video sub-track, selecting by the client device one or more URLs according to one Region Of Interest chosen by the client device or a client device's user, receiving from the client device, by the server, one or more request messages for requesting a resulting number of video sub-tracks, each request message comprising one of the URLs selected by the client device, and transmitting to the client device, by the server, video data corresponding to the requested video sub-tracks, in response to the request messages.
WO 2015/011109 A1 (CANON KK [JP]); CANON EUROP LTD (GB), 29 Jan. 2015 discloses encapsulating partitioned timed media data in a server, the partitioned timed media data comprising timed samples, each timed sample comprising a plurality of subsamples. After having selected at least one subsample from amongst the plurality of subsamples of one of the timed samples, one partition track comprising the selected subsample and one corresponding subsample of each of the other timed samples is created for each selected subsample. Next, at least one dependency box is created, each dependency box being related to a partition track and comprising at least one reference to one or more of the other created partition tracks, the at least one reference representing a decoding order dependency in relation to the one or more of the other partition tracks. Each of the partition tracks is independently encapsulated in at least one media file.
The above described processes and MPDs however do not allow a client device to flexibly and efficiently “compose” video mosaics on the basis of a large number of tile tracks associated with different tile positions and originating from different source video files (e.g. different ISOBMFF data containers produced by different encoding processes) that may be stored in different locations in the network.
Hence, there is a need in the art for improved methods, devices, systems and data structures that enable efficient selection and composition of a video mosaic on the basis of tile streams that are associated with different tile positions and that originate from different content sources. In particular, there is a need in the art for methods and systems that enable efficient and scalable solutions for composition of a video mosaic that can be delivered via a scalable transport scheme, e.g. multicast and/or CDNs, to a large number of client devices.