There are a number of applications and use cases like video conferencing, video surveillance, medical applications, panorama streaming, ad-insertion, picture in picture display or video overlay where several dedicated video bitstreams are simultaneously decoded and displayed to a user in a composed form. An illustrative example for such applications is traffic surveillance system with multiple video sources being presented to the user. A problem for such applications is that many devices incorporate only a single hardware video decoder or are otherwise limited in computational, power and/or other resources. Examples of such devices are Set-Top-Boxes (STBs), low-cost TV sets or battery powered mobile devices.
To enable said applications and use cases on such devices, a single video bitstream incorporating the several dedicated video bitstreams has to be created upfront. In order to achieve such a single video bitstream, pixel-domain video processing (e.g. composing such as stitching, merging or mixing) is typically applied, where the different video bitstreams are transcoded into a single bitstream. Transcoding can be implemented using a cascaded video decoder and encoder, which entails decoding the incoming bitstreams, composing a new video from the input bitstreams in the pixel-domain and encoding the new video into a single bitstream. This method can also be referred to as traditional full transcode that includes processing in the uncompressed domain. However, it has a number of drawbacks. First, the repeated encoding of video information is likely to introduce further signal quality degradation through coding artifacts. Second and more important, a full transcoding is computationally complex through the multiple de- and encoding of the in- and outgoing video bitstreams and therefore does not scale well.
Therefore, another approach has been presented in [1], where the video stitching is performed in the compressed domain. The main idea behind [1] is to set constraints at the encoders, e.g. disallowing some motion vector as well as motion vector prediction at picture boundaries, that allow for a low complexity bitstream rewriting process that can be applied to the different bitstreams in order to generate a single bitstream that contains all the videos that are intended to be mixed. This stitching approach is likewise computationally less complex than full transcoding and does not introduce signal quality degradation.
An illustrative example for such a system is shown in FIG. 23 for a video surveillance system using a cloud server infrastructure. As can be seen, multiple video bitstreams 900a-d are sent by different senders 902a-d and are stitched in a cloud mixer 904 to produce a single video bitstream 906.
A more detailed description of the techniques behind the applied stitching process can be found in [1].
Compressed domain processing can be applied to many applications and use cases to allow for low complexity video processing, saving battery life and/or implementation cost. However, the characteristics of each application pose individual problems for compressed domain video processing. Likewise the characteristics and features of a video compression standard/scheme can be utilized to enable low complexity compressed domain processing for new applications.
Problems that are not sufficiently addressed by way of the encoded domain stitching scheme of FIG. 23 occur, for example, if the way of composing the single video bitstream 906 out of the inbound video bitstreams 900a-d shall be subject to changes such as, for example, a rearrangement of inbound video bitstreams within the composed video bitstream 906, a spatial displacement of a certain input video bitstream within the composed video bitstream's 906 picture area or the like. For all of these cases, the composition scheme of FIG. 23 does not work properly due to temporal motion-compensated prediction which ties the individual pictures of the inbound video bitstreams 900a to 900d to each other temporally so that in a rearrangement of an inbound video bitstream without the usage of a detour via the decoded/uncompressed domain, is prohibited except for random access points of an inbound video bitstream represented by intra pictures not using any temporal motion-compensated prediction, which leads to a undesirable momentary increase of bitrate and bandwidth peaks. Thus, without any additional efforts, the freedom in varying the composition of output video bitstream 906 without leaving the compressed domain, would be restricted to take place merely at certain points in time by random access points of an inbound video bitstream not using any temporal motion-compensated prediction. A high frequency of such random access points within the inbound video bitstreams 900a-900d, however, involves a lower compression rate due to the lack of temporal predictors in intra predicted pictures.