The present invention relates to a method for determining predictor blocks of a first high resolution layer image from blocks of a second low resolution layer image and to a spatially scalable video codec which is configured to transcode between the different resolution blocks.
Since modern video broadcasting and teleconferencing systems rapidly expand to embrace all kinds of video-enabled appliances ranging from low-cost mobile phones up to high-end HDTV telepresence terminals, the need for resolution-scalable video streaming arises. While high-performance HDTV video terminals are capable of real-time decoding and playback of high resolution video stream, mobile devices are often limited in both, display resolution and computing resources which makes standard definition (SD) resolution optimal for such devices. One of the widely used solutions for that problem is video transcoding. The dedicated transcoding server decodes the incoming high resolution video streams, rescales them to lower resolution and then encodes the rescaled video sequences to produce the video streams for sending to low performance clients. This kind of solution severely suffers from high computational complexity of the transcoding process especially in case of multiple video streams processing, which requires expensive high-performance transcoding servers to be integrated into the broadcasting or teleconferencing system thereby significantly increasing both, system building and maintenance costs. Additional shortcoming of the transcoding solution is the piling up of image quality degradation introduced by the lossy video coding algorithms. The lower resolution video stream is derived from the decoded high resolution stream (rather than the original undistorted video sequence, which is not available on the transcoding server) which has already been distorted due to lossy video coding artefacts, so the second encoding stage adds even more coding distortion.
A more elegant solution comes from the scalable video codecs like H.263+ and H.264/SVC. The encoding to several different resolutions is performed by the video encoder operating on the video streaming source device, so that the video stream which is sent to the video broadcasting or teleconferencing server does already contain all required resolutions embedded as scalable layers and the server only needs to send to each video client the appropriate layer data which is most suitable for the client's performance and capabilities (or broadcast all layers' data so that each client would extract the most suitable layer itself). Since the dispatching of scalable layers is a much less computationally intensive task than the transcoding of multi-channel video streaming the costs of broadcasting/teleconferencing server are reduced dramatically. Additionally, this solution provides a good means of network packet loss protection. If the packets containing higher resolution layer data are lost during network transmission, the receiver still can decode and display the lower resolution image thereby avoiding image freezing or corruption which are the common problems for many video codecs used in error-prone networks. Moreover, due to unequal importance of different layers, the base layer for example is more important than the enhancement layers, efficient unequal FEC techniques may be used to improve packet loss protection while keeping FEC extra data overhead low.
Meanwhile, this solution still suffers from two serious drawbacks. It removes the overwhelming burden of transcoding from the server by distribution of multi-resolution encoding computations among the video source devices which increases the computational and memory resources utilization on those devices. A significant deterioration of coding efficiency is observed due to encoding of several video sequences (representing the same moving picture at various resolutions) to the constrained bit rate budget which might otherwise be more efficiently utilized by the highest resolution video sequence alone.
To mitigate the aforementioned problems, modern scalable video codec standards introduce inter-layer prediction coding mode. Each macro-block of higher resolution layer frame can be predicted using the collocated macro-block of the up-scaled lower resolution layer frame rather than neighbouring macro-blocks of the same frame (as in intra-prediction) or the reference frame of the same layer (as in inter-prediction). Inter-layer prediction helps to alleviate both problems. It does not require a computationally intensive motion estimation procedure for the higher resolution layer since it uses prediction macro-block of the same position in the up-scaled lower resolution frame as the macro-block being encoded. Prediction of the higher resolution image from the lower resolution version of the same image helps to decrease informational redundancy introduced by the encoding of several versions of the same image and thereby improves coding efficiency.
Still, inter-layer prediction fails to defeat the coding efficiency deterioration problem completely. Even when using it the coding efficiency of the multi-layer spatial scalability codec is up to 20% worse than the coding efficiency of the single-layer codec. Hence, improvement of coding efficiency for multi-layer spatial scalability codecs is an important problem for making the scalable codec based broadcasting/teleconferencing system cost-effective.
Modern inter-layer prediction algorithms for spatial scalability video codec should satisfy the following requirements: The algorithm should minimize prediction residual signal in order to provide better coding efficiency improvement. The algorithm should minimize the computational complexity and memory requirements of optimal prediction parameters search (if any). The algorithm should provide means for flexible quality-performance trade-off if computationally intensive optimal prediction parameters search is involved. The algorithm should lead to little increase in decoding complexity. The algorithm should allow easy and seamless integration into the existing scalable video codec architecture and infrastructure.
The older spatial scalability enabled codec H.263+ (Annex O) as described by ITU-T Recommendation H.263: “Video coding for low bit rate communication” on pp. 102-114 and as depicted in FIG. 10 performs inter-layer prediction only by up-sampling the reconstructed samples of the lower resolution layer signal. The low resolution image 1204 is obtained by downscaling 1201 the high resolution image 1202. The low resolution image 1204 is encoded and reconstructed 1203 for obtaining the SVC base layer image 1206 in low resolution which is up-scaled 1205 to the up-scaled SVC base layer image 1208 in high resolution. The inter-layer spatial prediction 1207 is applied to that up-scaled SVC base layer image 1208 to reconstruct the original SVC spatial layer image 1202 in high resolution.
The current state-of-the-art scalable video codec standard H.264/SVC additionally specifies inter-layer motion prediction mode where the motion parameters of the higher resolution layer (reference indexes, partitioning data, motion vectors) are predicted or derived using the up-scaled motion parameters of the lower resolution layer and inter-layer residual prediction mode where the residual signal of the higher resolution layer is predicted using the up-scaled residual signal of the lower resolution layer as described by T. Wiegand, G. Sullivan, J. Reichel, H. Schwarz, M. Wien, “Joint Draft ITU-T Rec. H.264|ISO/IEC 14496-10/Amd.3 Scalable video coding”, pp. 380-562.
However, those inter-layer prediction modes suffer from the inefficiency of prediction of the higher resolution signal from the up-scaled lower resolution signal for the regions containing distinct edges due to the fact that those edges become smeared after having been downscaled in the lower resolution layer encoder which uses the downscaled original higher resolution image as input data and subsequently up-scaled back in the higher resolution layer encoder which uses the up-scaled reconstructed lower resolution layer data as a predictor. Therefore, the inter-layer prediction generates a high energy residual signal for such regions which deteriorates the coding efficiency either by degrading the quality in order to fit into the constrained bit rate or by increasing the bit rate if the quality is retained.
Some efforts as described in the following have been made to improve the scalable coding. Selective usage of different up-sample filters adaptive to local image properties improves the inter-layer prediction efficiency as described by C. A. Segall and S.-M. Lei: “Method and apparatus for adaptive up-scaling for spatially scalable coding”, U.S. Pat. No. 7,876,833. This method augments the current six-tap up-scaling filter specified by the H.264/SVC standard with a few computationally simpler filters which can improve prediction for smooth image areas but does not improve edge crispness, and therefore, does not provide any coding efficiency improvement for areas containing distinct edges.
Adaptive smoothing of less important background regions of the image decreases bit budget consumed by such regions, thereby saving bits for more important regions of interest as described by D. Grois and O. Hadar, “Complexity-aware adaptive spatial pre-processing for ROI scalable video coding with dynamic transition region” in 18th IEEE International Conference on Image Processing, 2011. That method removes image details deemed unnecessary or non-important which is not always desirable especially for the case that image crispness should be retained. In addition, this method requires the pre-processing stage to recognize the ROI (region of interest) in the image which usually involves complicated computer vision technologies thereby significantly increasing computational complexity of the entire system.
Smoothing of the prediction signal in inter-layer residual prediction mode compensates for the restrictions imposed by the single-loop decoding approach and achieves better inter-layer prediction for that particular approach as described by W.-J. Han, “Smoothed reference prediction for single-loop decoding” in Joint Video Team 16th Meeting: Poznań, P L, 24-29 Jul. 2005, Document: JVT-P085. This method is only meaningful for the specific case of single-loop decoding approach and is of no practical use in the more general case.
Joint resolution enhancement and artifact reduction for MPEG-2 encoded video is applied to the decoded image for displaying lower resolution image on high-definition monitor as described by Y. Yang and L. Boroczky, “Joint resolution enhancement and artifact reduction for MPEG-2 encoded digital video” in Proceedings of International Conference on Multimedia and Expo, 2003. In this method, sharpness enhancement is applied to the decoded image at the decoder where the original image being predicted is not available. Therefore, such an approach lacks the ability to choose optimal sharpness enhancement parameters in order to achieve the best prediction efficiency and provides no improvement for coding efficiency at the encoder side.
Pre-processing and post-processing techniques including sharpness enhancement are applied to PEF (predictive error frame, i.e. residual frame produced by motion compensation) to improve the PEF coding efficiency in rate-scalable wavelet-based video codecs as described by E. Asbun, P. Salama and E. Delp, “Preprocessing and Postprocessing Techniques for Encoding Predictive Error Frames in Rate Scalable Video Codecs” in Proceedings of the 1999 International Workshop on Very Low Bitrate Video Coding. In this method, the sharpness enhancement is applied to the decoded PEF in the decoder rather than the prediction frame in the encoder, so it cannot be used to improve the efficiency of inter-layer prediction.