Advances in camera and image processing technology enable the recording and processing of content in ever higher resolutions and larger image formats. In order to handle and control the high bandwidth requirements of such formats and to allow the play out of these formats on user equipment with limited display capabilities, the large field-of-view image region of the frames of such high-resolution stream may be spatially divided into a grid of areas (which are usually referred to as tiles, segments or slices). The data in these tile areas of the image region may be encoded as separate streams so that they can be stored and distributed independently of each other. In this disclosure such stream may be referred to as a tile stream, which may be interchangeably used with the term ‘tile video stream’. A tile stream may comprise a sequence of tile frames that can be played out by a client according to a particular content, preferably play-out, timeline. The tile stream thus comprises frames that relate to a subregion (e.g. ‘the tile’ or ‘tile area’) of the image region, having a fixed (spatial) position (e.g. having fixed/static co-ordinates) within the image region. In other words its ‘spatial position’ in the image region does not change over time. The (static) position of this subregion within the image region may be defined by so called “tile position information”.
A user may select a region-of-interest (ROI) within the image region of frames of a high-resolution stream and a client may subsequently request the set of tile streams that are associated with the selected ROI. Upon reception of the requested tile streams, the client may decode the streams and stitch the decoded tile frames together so that a seamless image of the selected ROI can be displayed to the user. This process may be referred to as tiled streaming.
When moving the ROI within a large field-of-view image region (e.g. via a panning-, zoom- or tilting action by the user) different sets of tile streams need to be delivered to the client in order to render a seamless view of the newly selected ROI's to the user. This way, tiled streaming allows a user to interact with the content.
Typically tiles (i.e. tile streams) are delivered to a client using an adaptive streaming protocol. In these implementations a client (i.e. client device) may be provided with a so-called spatial manifest file that comprises one or more different tile representations of a source video, typically a large field-of-view area, high-resolution source video, wherein a tile representation may define a set of tile streams of a predetermined tile format in terms of e.g. resolution and/or tile size, and/or tile position. The spatial manifest file may further comprise tile stream identifiers (e.g. URLs) for determining one or more delivery nodes in the network that are configured for delivering the tile streams to the client. On the basis of the spatial manifest file, the client can handle changes in the user-selected ROI and/or tile representation and the associated requests for tile streams from a server or a content delivery network (CDN). Advantageous implementations of content delivery networks (CDNs) that are configured for efficiently delivering content on the basis of tile streams and clients that are configured for receiving and processing tile streams are described in WO2012/168365.
In many user scenarios however, it is not necessary, or even desired that the content is continuously played out by the client as a set of tiled streams. In case of a soccer match a user may only be interested in interacting with the media stream at certain points in time, e.g. when an off-screen foul is made but the camera operator decides to follow the ball instead. Moreover, in terms of bandwidth and processing load of the client, it is desired to reduce the number of streams that are needed for displaying a ROI of large field-of-view image region as for each tile stream a separate instance of a decoder should be started.
In the article by Mavlankar at al, “Interactive region-of-interest video streaming system for online lecture viewing”, a tiled streaming system is described that comprises a tracking mode and a user control mode. The user mode is a tile stream mode wherein the user may select a ROI that is associated with a set of tile streams. The tile streams are received and processed by the client such that a user-selected ROI is rendered. In the tracking mode, a ROI is determined by a tracking algorithm (a “virtual camera operator”) that can track an object in an image. A ROI stream (in the article referred to as a “tracking tile”) is generated by directly cropping the ROI from the (high-resolution wide-area format) source video. The tracking mode is thus a non-tiled mode wherein the ROI is streamed in a single stream to the client and there will be no switching between different tile streams for rendering a display of the ROI. A user can passively follow a moving ROI in the video (e.g. the position of the ball in a football match), without having to actively navigate around the large field-of-view image region himself.
The tracking mode concept as described in Mavlankar was developed within the context of online lecture viewing. However, it does not provide the ability to perform seamless switching between the ROI in the tracking mode and the ROI in the user control mode. Such switching functionality would be desired in more general content broadcast and streaming applications. For example if a user is watching a soccer match in the tracking mode (e.g. watching the studio ‘cut’ edited by the director), a user may want to explore in a seaming less fashion the area where a foul was made, without having to switch to a different mode, thereby interrupting the video, and without having to first look for the particular ROI in the large field-of-view ‘fingernail’ picture in the user control mode, as such action would seriously disrupt the continuous user experience. In addition, in certain implementations, it may not be desirable or feasible at all to always provide a ‘fingernail’ view of the source video's ‘full image view’, as suggested by Mavlankar. Even moreso, a ‘fingernail’ view for ROI selection may be inadequate and cumbersome, if the full image view is large compared to the (user determined) ROI that is desired. Such may be the case if the full image view is large and provides lots of detail, and more in particular when for instance zooming in combination with panning is performed during ROI selection.
Hence, there is a need in the art for improved methods and systems that enable efficient streaming a region of interest in frames with a large field-of-view area to a client. Further, the is a need in the art enabling smooth or even seamless switching between streaming a region-of-interest on the basis of a single stream to a client (in a non-tiled mode) and the streaming of a region-of-interest on the basis of a set of tile streams to the client(in a tiled mode).