High resolution displays are garnering increasing popularity. Televisions with 8K resolution (7680×4320 pixels) display are already available for sale, and even a cellphone can have 4K Ultra-HD (high definition) resolution (3840×1260 pixels). Unfortunately, the availability of high resolution media content has not kept up with the increase in display resolutions. For example, one popular video streaming service currently offers only 26 titles in 4K Ultra-HD resolution. In addition, transmitting 4K Ultra-HD video requires high communication bandwidth. Given the abundant amount of existing lower resolution videos, as well as limited communication bandwidth, it would be desirable to up-sample these videos to higher resolution at the display.
The default up-sampling on televisions is typically simple interpolation and filtering with added sharpening. Due to the simplicity of these methods, the visual quality of the output is generally not satisfactory. Super resolution (SR) can provide higher visual quality results by exploiting the non-local similarity of patches or learning a mapping relating pixels from the low-resolution videos to pixels of high-resolution videos from external datasets. However, SR algorithms are computationally more expensive and slower than simple interpolation/filtering. For instance, state-of-the-art neural network based SR algorithms require powerful graphical processing units (GPUs) such as the NVIDIA Grid K2 8GB graphics card that consume around 225W to achieve real-time performance. The speed and power consumption of these algorithms, therefore, limit their applicability to televisions and mobile screens.
There are two main forms of super-resolution algorithms: single frame and multiple frame. Typically, televisions use simple single-frame based up-samplers, including bicubic, sinc, Lanczos, Catmull-Rom, and Mitchell-Netravali. These up-samplers are generally based on simple splines, enabling real-time throughput. However, since these methods are not content adaptive, they may introduce unwanted video artifacts.
More sophisticated super-resolution algorithms typically leverage machine learning techniques. Among them are sparse-representation, Kernel Ridge Regression (KRR), anchored neighbor regression (ANR), and in-place example regression. More recently, deep neural networks have been used to perform super-resolution (e.g. SRCNN). Such methods apply several layers of convolution and non-linear functions to map the low-resolution image to higher resolution. They achieve state-of-the-art results, but at high computation cost. As an example, SRCNN use filters of size 9×9×64, 64×32, 32×5×5, which amounts to 8032 multiplications per pixel. Hence, it is significantly more complicated than simple interpolation with one filter.
Even if consumer devices can be accelerated with high powered GPUs (e.g., K2 consumes 225W) to achieve real-time performance, these GPUs consume far too much power to be embedded in televisions and portable devices like phones and tablets. Moreover, even with high computation resources, these super-resolution algorithms can only achieve real-time throughput on high-definition (HD) videos (1920×1080), and not on videos of 4K resolution and beyond. Complementary to learning approaches, there are algorithms that exploit the self-similarities of blocks within each image. However, they are much slower than SRCNN.
Previous multiple-frame based super-resolution algorithms have been largely based on the registration of neighboring frames. Many of these algorithms are iterative, including the Bayesian based approach and a li-regularized total variation based approach. At the same time, there are non-iterative methods that avoid registration with non-local mean and 3D steer kernel regression. Deep neural networks can also be used in the form of bidirectional recurrent convolutional networks, and deep draft-ensemble learning. Unfortunately, these multiple-frame algorithms are generally too slow for real-time applications and are generally run offline. Other video coding techniques, for example, motion compensation and Group-of-pictures (GOP) structure are similarly problematic. Therefore, there is a need in the industry to address one or more of the abovementioned shortcomings.