The present application is concerned with a video streaming concept suitable for composing a video stream out of a coded version of a video content.
There are a number of applications and use cases where a composited form of multiple videos is simultaneously transmitted to and displayed to a user. While a first approach is to send all videos independently encoded so that multiple decoders are simultaneously used and the composited video is displayed by arranging all the videos once decoded, a problem is that many target devices incorporate only a single hardware video decoder. Examples of such devices are low-cost TV sets and Set-Top-Boxes (STBs) or battery powered mobile devices.
In order to generate a single video bitstream from the multiple videos, a second approach is pixel-domain video processing (e.g. composing such as stitching, merging or mixing), where the different video bitstreams are transcoded into a single bitstream to be transmitted to the target device. Transcoding can be implemented using a cascaded video decoder and encoder, which entails decoding the incoming bitstreams, composing a new video from the input bitstreams in the pixel-domain and encoding the new video into a single bitstream. This approach can also be referred to as traditional full transcode that includes processing in the uncompressed domain. However, full transcoding has a number of drawbacks. First, the repeated encoding of video information is likely to introduce signal quality degradation through additional coding artifacts. Second and more important, a full transcoding is computationally complex through the multiple de- and subsequent encoding of the in- and outgoing video bitstreams. Therefore, a full transcode approach does not scale well.
Using High Efficiency Video Coding (HEVC) [1], a technique is introduced in [2] that allows for achieving video compositing in the compressed domain for single-layer video codecs. However, there are some applications where using a scalable video coding might be advantageous. In [3], a technique is described that allows for video stitching in the compressed domain for a scalable coded video that can be used for applications such as multi-party video conferencing.
Problems incurred in video conferencing applications are described in the following.
In particular, FIG. 23 represents a typical video composition of a multi-party video conference. The pictures of the composed video, one of which is exemplarily shown in FIG. 23, are spatially stitched together. In the scenario of FIG. 23, the speaker is shown in a bigger picture area 900 of the output picture while the non-speakers are shown in smaller areas 902 and 904 of the output picture. FIG. 24 exemplarily shows how the composed video bitstream is obtained by video processing 906 on the basis of coded data streams representing the individual videos shown in the areas 900 to 904. In FIG. 24, data stream 908 shall denote the data stream having encoded thereinto the video shown at area 900, i.e. the video concerning the speaker, while data streams 910 and 912 shown in FIG. 24 have encoded thereinto the videos concerning the non-speakers presented in areas 902 and 904, respectively. In order to illustrate the way the videos are encoded into data streams 908 to 912, FIG. 24 illustrates pictures of these videos, namely pictures thereof belonging to two consecutive times instants t0 and t1 and belonging to two different layers L0 and L1, respectively. The arrows shown in FIG. 24 represent prediction dependencies between the pictures. As can be seen, temporal prediction (horizontal errors) and inter-layer prediction (vertical errors) are used for encoding the videos into data streams 908 to 912. Scalable video coding standards such as H.264/SVC have been used previously in video conferencing systems and have proven to be very valuable, and so are the expectations for SHVC in the area.
In accordance with the technique outlined in [3], the video processing 906 may allow for generating a single scalable bitstream out of the multiple bitstreams 908 to 912 by stitching all input bitstreams 908 to 912 in the compressed domain. The resulting single scalable bitstream is shown in FIG. 24 at 914, also by way of illustrating a fraction consisting of four pictures thereof, namely pictures belonging to different pairs of time instant and layer. The technique applied by video processing 906 in accordance with [3] involves rewriting a couple of fields in the high level syntax, such as slice header and parameter sets, so that each picture in each layer from the different input streams 908 to 912 is combined into a single picture for each layer containing the data from all streams.
Alternatively, if not all input streams 908 and 912 have the same amount of layers, as it is depicted in FIG. 25, where input bitstreams 910 and 912 are illustratively shown as being single-layered, the lower layers of the output bitstream 914, namely layer L0 in the case of FIG. 25, has no correspondent data from the latter data streams 910 and 912. Accordingly, these lower layers, i.e. L0 in FIG. 25, of the output data stream 914, will not be generated with data from all input bitstreams 908 to 912, but some dummy data will be added to the pictures of this layer L0 as needed, as shown by white boxes 916 in FIG. 25.
For the method described in [3], whenever a layout change event occurs, e.g. such as during a speaker change, the sizes of the blocks in the picture change as a result and a significant bitrate peak occurs. More concretely, there is a need to send an Intra Decoding Refresh (IDR) or I-frame to change the picture layout or size. On a layout change event, IDRs are used for the bitstreams that switch roles from speaker to non-speaker and vice versa, which results in a momentary significant bitrate increase. This is illustrated in FIG. 26, which shows the output bitstream 914, here exemplarily a fraction thereof encompassing four consecutive time instants t0 to t3. As shown in FIG. 26, temporal prediction is disabled 918 when such a layout change event occurs, which is time instant t2 in the case of FIG. 26, from which time onwards the speaker changes, i.e. the speaker video in one of the non-speaker videos changes its position or area within the composed video pictures. However, the disablement uses a transmission of relatively more intra data, which is independently coded from other pictures, and increases the data that is transmitted at such points in time which is a burden in many use cases, e.g. real-time communication.
Thus, it is the object of the present invention to provide a concept for video streaming of a video stream composed of a coded version of a video content which is more efficient, such as more efficient in terms of the freedom to change the composition without, or with less, penalties in terms of bitrate consumption.