In certain video applications, such as applications related to Virtual Reality (VR) and 360 degree video, there may be a desire to enhance areas in the video frames of a media stream, which are of interest to the viewer. Such an area may be referred to as a Region of Interest (ROI). Known techniques for determining a ROI in an image area are content-based. Typically, the content in the video frames is analysed using e.g. object tracking, optical flows (motion in the video), face detection, car detection, etc. Thereafter, a detected ROI may be processed in order to enhance the video quality within the ROI.
One approach to achieve such local quality enhancement in video images is a technique, which is referred to as ROI-based video coding. ROI-based video coding may be combined with Scalable Video Coding (SVC), which is an extension of the video coding standard MPEG-4 part 10 AVC/H.264, which enables a multi-layer coding scheme. In its simplest usage, the original video is encoded in different dependent layers providing different quality levels and resolutions, e.g. a base layer and one or more enhancement layers wherein the base layer provides the lowest quality and wherein enhancement layers only comprises residual information (i.e. the high quality minus low quality information) in encoded form so that the base layer combined with an enhancement layer produces high quality video frames. Hence, if the user wants more quality, the SVC decoder will have to decode the base layer plus the first enhancement layer, whereby the decoding of the enhancement layer depends on the base layer. Similarly, decoding the base layer in combination with the first and second enhancement layer will procedure an even higher quality video. By higher quality video we mean either a higher spatial resolution, i.e. more pixels, or an higher signal-to-noise ratio (SNR) which reproduces the original source video with more fidelity. Both enhancements contributes to a higher video quality perceived by the human eye.
The article by Jung-Hwan Lee; Yoo, C., “Scalable roi algorithm for H.264/SVC-based video streaming,” in Consumer Electronics, IEEE Transactions on, vol. 57, no. 2, pp. 882-887, May 2011, a technique is described to enhance ROIs of a video stream by making use of the SVC video coding standard wherein the base layer is used to encode the video in an acceptable quality. Additionally, an enhancement layer is produced only comprising a ROI that is encoded in a higher quality. As a result, the different enhancement layers only cover part of the full image areas of the video frames of the base layer. The ROI enhancement layers are contained in concentric slices whose shapes are enabled by the Flexible Macroblock Ordering (FMO) feature as described in MPEG-4 part AVC/H.264. In this prior art technique, the decoding of the enhancement layer depends on the availability of the base layer.
One problem associated with ROI-based video coding is that it relies on a-priori knowledge about ROIs in which viewer are interested. When using high-density panorama-type or immersive-type video, the number of detected objects and associated ROIs may increase substantially and the availability of such a-priory knowledge can no longer be maintained. For example, different users may be interested in different ROI, e.g. in video surveillance, a first police officer may want to have high quality view on cars (in particular license plates) in an image while another second police officer may be solely interested in high quality view of the face of pedestrians. In such situation an enhancement layer may be generated comprising both ROIs so that a first user will receive the information on the ROI of a second user and vice-versa thereby causing waste of bandwidth. Even though multiple layers may be produced on the basis of a number of user profiles, at the end such approach does not provide a scalable solution. With hundreds of users, it is not computationally efficient to produce hundreds of enhancement layers. As a consequence, a lot of areas that are initially identified as a ROI, decoded as part of an enhancement layer and transmitted to a client will—at the end—not be relevant for a user. Thus causing a substantial waste of bandwidth. In other applications, the above-mentioned a-prior knowledge on ROIs simply does not exist. For example, in context-based applications such as gaze detection or in user interface applications wherein a user select one or more ROIs there is no way to know at the encoding stage which parts of the image region of video frames will be a ROI. In such applications, existing ROI-based video coding schemes cannot be used.
WO2014111423 describes a system for providing a video comprising a High Quality (HQ) ROI of an increased video quality. WO2014111423 proposes two basic solutions to achieve this. One solution is based on a scalable video codec such as the SVC extension to AVC. In this solution from a source video, encoded video streams are generated each comprising a base layer covering the full image view of the source video and at least one enhancement layer comprising a portion (tile) of the full image view. Decoding of each video stream requires an independent decoding instance and decoding of the enhancement layer of each video stream also requires the availability of the base layer covering the full image view. Formation of a decoded video stream comprising a HQ ROI includes selection of enhancement layers that comprise one or more HQ tiles covering the ROI, individual decoding by separate decoding instances of each selected enhancement layer on the basis of the base layer in order to form a number of video frames each comprising one or more HQ tiles at different positions and finally combining the video frames into a video frame comprising the HQ ROI.
In the alternative solution, that doesn't make use of a scalable video codec, a number of different elementary streams are generated, with each time an encoded video, with for each elementary stream a different high quality tile and the remainder of the tiles in low quality. Formation of a decoded video stream comprising a HQ ROI includes selection of the appropriate elementary streams comprising the one or more HQ tiles needed to cover the ROI, individual decoding by separate decoding instances of each selected elementary stream on the basis of the base layer in order to form a number of video frames each comprising one or more HQ tiles at different positions and finally combining the video frames into a video frame comprising the HQ ROI.
In both the disclosed solutions, the combining is performed through signalling that the HQ tiles should be overlayed (e.g. super-imposed/placed on top of/placed in front of) on the portion of the base layer image covered by the HQ tiles during combination for display.
The proposed solutions requires parallel decoding of media data in order to form multiple video frames each comprising one (or more) HQ tiles and subsequent combination of the multiple video frames into a video frame comprising an HQ ROI. As a result, the number of required parallel decoding processes/instances of independent video streams which may or may not comprise enhancement layers, scales linearly with the number of required HQ tiles needed to cover a selected ROI, or selected ROI's. Therefore when increasing the number of tiles and the number of simultaneously selectable ROIs such scheme would require a substantial amount of decoding instances running in parallel which constrains the number of simultaneously selectable ROI's and the granularity of the tile grid (e.g. the number of available tiles) to the device capabilities.
More particularly, in WO2014111423 the burden on the client increases linearly with the number of decoded tiles. This is problematic as a ROI enhancement application typically requires a fine selection of the region that needs to be enhanced in order to adapt to the shape of the content (e.g. a truck in a video surveillance). Hence, in such application a fine tiling grid of the original video is desired. As a consequence, it is very likely that the client has to separately retrieve and decode for instance nine or more elementary streams/enhancement layers in order to form one video frame comprising an enhanced ROI. Decoding that many elementary streams/enhancement layers however is computationally intensive and challenging for memory management as for each elementary stream/enhancement layer a separate decoding pipeline is required. Moreover, when combining the decoded video frames into a video frame comprising an HQ ROI a substantial amount of decoded media data is not used in the resulting video frame thereby rendering the decoding process inefficient in terms of decoding resources.
In addition the proposed ‘scalable video codec’ solution in WO2014111423, as described above, is dependent on the client device having a decoder supporting a scalable video codec.
Moreso the alternative solution proposed by WO2014111423, which is based on a non-scalable codec, proposes elementary streams that each contain, besides a high quality tile, low quality tiles as well. This introduces a significant redundancy in video data, to be retrieved and decoded, which scales linearily with the granularity of the grid of tiles.
Hence, from the above it follows that there is a need in the art for improved methods and systems that enable simple and efficient enhancement of one or more Regions of Interest in video frames of a video stream.