Field
Example embodiments relate to methods for enhancing lower-quality visual data using a trained hierarchical algorithm, and creating then outputting higher-quality visual data.
Background
Increase in Quality of Video and Display Technology
Developments in display technology have led to significant improvements in the resolution able to be displayed on display hardware, such as on televisions, on computer monitors and using video projectors. For example, television screens that are able to display “High Definition” or “HD” resolution content (typically having a resolution of 1920×1080 pixels) have been broadly adopted by consumers. More recently, television screens able to display Ultra High Definition or “Ultra HD” resolution content (typically having a resolution over 3840×2160 pixels) are starting to become more widespread.
In contrast, HD resolution video content is only now becoming commonplace and most legacy content is only available at either Digital Versatile Disc Video (or “DVD-Video”) resolution (typically having a resolution of 720×586 pixels or 720×480 pixels) or Standard Definition or “SD” resolution (where the video content only has a resolution of 640×480 pixels). Some broadcast channels are limited to SD resolutions. Video-streaming services can be restricted to operating at DVD-Video or SD resolutions, to reduce transmission problems where consumers have limitations on available transmission bandwidth or because of a lack of legacy content at higher resolutions.
As a result, there can be a lack of sufficiently high-resolution video content for display on HD and Ultra HD television screens, for both current video content as well as for legacy video content and video streaming services. Also, over time mobile devices such as mobile ‘phones and tablet computers with increasingly larger and higher-resolution screens are being produced and adopted by users. Further, current video content, being output at HD resolutions, is already at a significantly lower resolution than can be displayed by the latest consumer displays operating at, for example, Ultra HD resolutions. To provide sufficiently immersive virtual reality (or “VR”) experiences, display technology needs to be sufficiently high resolution even for smaller screen sizes.
The user experience of having to display content that has significantly lower resolution than the user's default screen/display resolution is not optimal.
Growth in Data Transmission and Network Limitations
The amount of visual data being communicated over data networks such as the Internet has grown dramatically over time and there is increasing consumer demand for high-resolution, high quality, high fidelity visual data content, such as video streaming including, for example, video at HD and Ultra HD resolution. As a result, there are substantial challenges in meeting this growing consumer demand and high performance video compression is required to enable efficient use of existing network infrastructure and capacity.
Video data already makes up a significant fraction of all data traffic communicated over the Internet, and mobile video (i.e. video transmitted to and from mobile devices over wireless data networks such as UTMS/CDMA) is predicted to increase 13-fold between 2014 and 2019, accounting for 72 percent of total mobile data traffic by the end of that forecast period. As a result, there are substantial challenges in meeting this growing consumer demand and more efficient visual data transmission is required to enable efficient use of existing network infrastructure and capacity.
To stream video to consumers using available streaming data bandwidth, media content providers can down-sample or transcode the video content for transmission over a network at one or a variety of bitrates so that the resolution of the video can be appropriate for the bitrate available over each connection or to each device and correspondingly the amount of data transferred over the network can be better matched to the available reliable data rates. For example, a significant proportion of current consumer Internet connections are not able to reliably support continuous streaming of video at an Ultra HD resolution, so video needs to be streamed at a lower quality or lower resolution to avoid buffering delays.
Further, where a consumer wishes to broadcast or transmit video content, the uplink speeds of consumer Internet connections are typically a fraction of the download speeds and thus only lower quality or lower resolution video can typically be transmitted. In addition, the data transfer speeds of typical consumer wireless networks are another potential bottleneck when streaming video data for video at resolutions higher than HD resolutions or virtual reality data and content to/from contemporary virtual reality devices. A problem with reducing the resolution of a video when transmitting it over a network is that the reduced resolution video may not be at the desired playback resolution, but in some cases there is either not sufficient bandwidth or the bandwidth available is not reliable during peak times for transmission of a video at a high resolution.
Alternatively, even without reducing the original video resolution, the original video may have a lower resolution than desired for playback and so may appear at a suboptimal quality when displayed on higher-resolution screens.
Video Compression Techniques
Existing commonly used video compression techniques, such as H.264 and VP8, as well as proposed techniques, such as H.265, HEVC and VP9, all generally use similar approaches and families of compression techniques. These compression techniques make a trade-off between the quality and the bit-rate of video data streams when providing inter-frame and intra-frame compression, but the amount of compression possible is largely dependent on the image resolution of each frame and the complexity of the image sequences.
To illustrate the relationship between bitrate and resolution among other factors, it is possible to use an empirically-derived formula to show how the bitrate of a video encoded with, for example the H.264 compression technique, relates to the resolution of that video:bitrate∝Q×w×h×f×m 
where Q is the quality constant, w is the width of a video, h is the height of a video, f is the frame-rate of a video and m is the motion rank, where m∈{1, . . . , 4} and a higher m is used for fast-changing hard-to-predict content.
The above formula illustrates the direct relationship between the bitrate and the quality constant Q. A typical value, for example, that could be selected for Q would be 0.07 based on published empirical data, but a significant amount of research is directed to optimising a value for Q.
The above formula also illustrates the direct relationship between the bitrate and the complexity of the image sequences, i.e. variable m. The aforementioned existing video codecs focus on spatial and temporal compression techniques. The newer proposed video compression techniques, such as H.265, HEVC and VP9, seek to improve upon the motion prediction and intra-frame compression of previous techniques, i.e. optimising a value form.
The above formula further illustrates a direct relationship between the bitrate and the resolution of the video, i.e. variables w and h. In order to reduce the resolution of video, several techniques exist to downscale the resolution of video data to reduce the bitrate.
As a result of the disadvantages of current compression approaches, existing network infrastructure and video streaming mechanisms are becoming increasingly inadequate to deliver large volumes of high quality video content to meet ever-growing consumer demands for this type of content. This can be of particular relevance in some circumstances, for example in relation to live broadcasts, where bandwidth is often limited, and extensive processing and video compression cannot take place at the location of the live broadcast without a significant delay due to inadequate computing resources being available at the location.
Video Upscaling Techniques
To reproduce a video at a higher resolution than that at which it has been transmitted (e.g. by a streaming service or broadcaster) or provided (e.g. on DVD or via a video download provider), various “upscaling” techniques exist to increase the resolution of video data/signals, which enhance image quality when starting from a lower resolution image or video and which produce an image or video of a higher resolution.
Referring to FIG. 14, a conventional upscaling technique 1400 will now be described.
Received video data 1410 is provided into a decoder system and is, for example, a lower-resolution video encoded in a standard video format, such as an SD resolution video. This video format can be a variety of known video codecs, for example such as H.264 or VP8, but can be any video data that the system is able to decode into component frames of video.
The system then separates a first section of the video data 1410 into single frames at step 1420, i.e. into a sequence of images at the full SD resolution of the video data 1410. For some video codecs, this will involve “uncompressing” or restoring the video data as, for example, common video compression techniques remove redundant (non-changing) features from sequential frames.
An upscaling technique 1430 is then used on one or more of the frames or sections of frames, to increase the resolution of the areas upon which it is used. The higher resolution frames are then optionally processed at step 1440 into a format suitable for output as a video. The video, being composed of higher resolution frames, will be in the form of a higher resolution video 1450 than the original video file.
For example, a basic upscaling technique that makes little attempt to enhance the quality of the video is known as nearest-neighbour interpolation. This technique simply increases the resolution of received video data by representing an original pixel of the transmitted video as multiple pixels or a “block” of pixels. The resulting effect is that the video appears pixelated and in blocks.
Other less basic upscaling techniques use the existing video data to estimate unknown intermediate pixels between known pixel values in order to increase the resolution with a less noticeable loss in quality, these techniques generally known by the term interpolation, these techniques typically factoring into account a weighted average of known pixels in the vicinity of each unknown intermediate pixel or fit to a curve or line to surrounding values and interpolate to the mid-point along the curve or line (e.g. bicubic or bilinear interpolation). Typically, such upscaling techniques determine values for the additional pixels required to create a higher resolution image by averaging neighbouring pixels, which creates a blurring effect or other visual artefacts such as “ringing” artefacts. Most upscaling techniques use interpolation-based techniques to produce higher-resolution versions of received video data. Various methods of interpolation are possible and well documented in the prior art in relation to video or image enhancement.
Various methods of interpolation are possible and well documented in the prior art in relation to video or image enhancement. There are many problems with conventional upscaling techniques. Upscaling techniques that reduce jagged edges tend to introduce more blur to an up-scaled video, for example, while upscaling techniques that reduce “halos” or “ringing” artefacts tend to make an up-scaled video less sharp. Further, conventional upscaling techniques are not content-aware or adaptive. Fundamentally, conventional upscaling techniques are limited by the Nyquist-Shannon sampling theorem.
As a result of the disadvantages of current upscaling techniques, the quality of video data that has been “up-scaled” to a higher resolution than that at which it is stored or transmitted can be inadequate or non-optimal for its intended function.
Super Resolution Techniques for Enhancing Images
Super resolution techniques are techniques that can be described as recovering new high-resolution information that is not explicitly present in low-resolution images.
Super resolution techniques have been developed for many different applications, such as for satellite and for aerial imaging and medical image analysis for example. These applications start with low-resolution images where the higher-resolution image is not available or is possibly unknowable, and by using super resolution techniques it is possible to make substantial enhancements to the resolution of such low-resolution images.
Super resolution techniques allow for the creation of one or more high-resolution images, typically from one or more low-resolution images. Typically, super resolution is applied to a set or series of low-resolution images of the same scene and the technique attempts to reconstruct a higher-resolution image of the same scene from these images.
Super resolution techniques fall predominantly into one of two main fields; optical super resolution techniques and geometrical super resolution techniques. Optical super resolution techniques allow an image to exceed the diffraction limit originally placed on it, while geometrical super resolution techniques increase the resolution from digital imaging sensors. In the field of image resolution enhancement, geometrical super resolution seems to be the predominant technique.
Further, super resolution approaches are usually split into learning- or example-based approaches and interpolation-based (multi-frame) approaches. Example based super resolution techniques are generally accepted to be a superior technique to enhance image quality.
One specific super resolution technique is termed multi-exposure image noise reduction. This technique takes the average of many exposures in order to remove unwanted noise from an image and increase the resolution.
Another super resolution technique employed is sub-pixel image localisation, which involves calculating the ‘centre of gravity’ of the distribution of light over several adjacent pixels and correcting blurring accordingly. However, this technique relies on the assumption that all light in the image came from the same source, which is not always a correct assumption.
Machine Learning Techniques
Machine learning is the field of study where a computer or computers learn to perform classes of tasks using the feedback generated from the experience or data gathered that the machine learning process acquires during computer performance of those tasks.
Typically, machine learning can be broadly classed as supervised and unsupervised approaches, although there are particular approaches such as reinforcement learning and semi-supervised learning which have special rules, techniques and/or approaches.
Supervised machine learning is concerned with a computer learning one or more rules or functions to map between example inputs and desired outputs as predetermined by an operator or programmer, usually where a data set containing the inputs is labelled.
Unsupervised learning is concerned with determining a structure for input data, for example when performing pattern recognition, and typically uses unlabelled data sets.
Reinforcement learning is concerned with enabling a computer or computers to interact with a dynamic environment, for example when playing a game or driving a vehicle.
Various hybrids of these categories are possible, such as “semi-supervised” machine learning where a training data set has only been partially labelled.
For unsupervised machine learning, there is a range of possible applications such as, for example, the application of computer vision techniques to image processing or video enhancement. Unsupervised machine learning is typically applied to solve problems where an unknown data structure might be present in the data. As the data is unlabelled, the machine learning process is required to operate to identify implicit relationships between the data for example by deriving a clustering metric based on internally derived information. For example, an unsupervised learning technique can be used to reduce the dimensionality of a data set and attempt to identify and model relationships between clusters in the data set, and can for example generate measures of cluster membership or identify hubs or nodes in or between clusters (for example using a technique referred to as weighted correlation network analysis, which can be applied to high-dimensional data sets, or using k-means clustering to cluster data by a measure of the Euclidean distance between each datum).
Semi-supervised learning is typically applied to solve problems where there is a partially labelled data set, for example where only a subset of the data is labelled. Semi-supervised machine learning makes use of externally provided labels and objective functions as well as any implicit data relationships.
When initially configuring a machine learning system, particularly when using a supervised machine learning approach, the machine learning algorithm can be provided with some training data or a set of training examples, in which each example is typically a pair of an input signal/vector and a desired output value, label (or classification) or signal. The machine learning algorithm analyses the training data and produces a generalised function that can be used with unseen data sets to produce desired output values or signals for the unseen input vectors/signals. The user needs to decide what type of data is to be used as the training data, and to prepare a representative real-world set of data. The user must however take care to ensure that the training data contains enough information to accurately predict desired output values without providing too many features (which can result in too many dimensions being considered by the machine learning process during training, and could also mean that the machine learning process does not converge to good solutions for all or specific examples). The user must also determine the desired structure of the learned or generalised function, for example whether to use support vector machines or decision trees.
The use of unsupervised or semi-supervised machine learning approaches are sometimes used when labelled data is not readily available, or where the system generates new labelled data from unknown data given some initial seed labels.
Current training approaches for most machine learning algorithms can take significant periods of time, which delays the utility of machine learning approaches and also prevents the use of machine learning techniques in a wider field of potential application.
Machine Learning & Image Super Resolution
To improve the effectiveness of some super resolution techniques, it is possible to incorporate machine learning, otherwise termed a “learned approach”, into the image super resolution techniques described above.
For example, one machine learning approach that can be used for image enhancement, using dictionary representations for images, is a technique generally referred to as dictionary learning. This approach has shown effectiveness in low-level vision tasks like image restoration.
When using dictionary learning, the representation of a signal is given as a linear combination of functions drawn from a collection of atoms referred to as a dictionary. For example, a given signal y can be represented as:y=α1x1+α2x2+ . . . +αnxn where x1, . . . , xn are the atoms of a dictionary of size n and α1, . . . αn are coefficients such that ∥α∥0<λ, where λ is the sparsity constraint, for example where λ=3 no more than three coefficients can be non-zero. The atoms have the same dimensionality as the signal y so, while it is possible to have an atom xi that is identical to y, a dictionary of simple atoms can usually be used to reconstruct a wide range of different signals.
In theory, at least k orthogonal atoms are required to fully reconstruct signals in k-dimensional space. In practice, however, improved results are achieved through using an over-complete dictionary where there are n>k atoms and these atoms do not have to be orthogonal to one another.
A complete dictionary means that the number of dictionary atoms is the same as the dimensionality of the image patches and that the dictionary atoms are linearly independent (i.e. all orthogonal to each other and can represent the entire, or complete, dimensional space), so where 16×16 atoms represent 16×16 image patches, the dictionary is complete if it has 16×16=256 atoms. If more atoms than this are present in the dictionary, then the dictionary becomes over-complete.
An example of an over-complete dictionary is shown in FIG. 1, where a 16×16 pixel patch is represented by a linear combination of 16×16 dictionary atoms 5 that is drawn from the collection of atoms that is the dictionary 1. It is noted that the atoms are not selected locally within the dictionary, but instead are chosen as the linear combination that best approximates the signal patch for a maximum number of atoms allowed and irrespective of their location within the dictionary. Without a constraint that the atoms must be orthogonal to one another, larger dictionaries than the signal space that the dictionary is intended to represent are created.
Over-complete dictionaries are used because they provide better reconstructions, but at the cost of needing to store and transmit all of the new dictionaries and representations created during the dictionary learning process. In comparison with a predetermined library of representations, a significantly increased amount of data is created as a result of dictionary learning because it generates a data set significantly larger than the basis set in a predetermined library of representations and the atoms are not all orthogonal to one another.
In dictionary learning, where sufficient representations are not available in an existing library of representations (or there is no library available), machine learning techniques are employed to tailor dictionary atoms such that they can adapt to the image features and obtain more accurate representations. Each new representation is then transferred along with the video data to enable the representation to be used when recreating the video for viewing.
The transform domain can be a dictionary of image atoms, which can be learnt through a training process known as dictionary learning that tries to discover the correspondence between low-resolution and high-resolution sections of images (or “patches”). Dictionary learning uses a set of linear representations to represent an image and, where an over-complete dictionary is used, a plurality of linear representations can be used to represent each image patch to increase the accuracy of the representation.
When using dictionary learning based super resolution techniques, there is a need for two dictionaries: one for the low-resolution image and a separate dictionary for the high-resolution image. To combine super resolution techniques with dictionary learning, reconstruction models are created to enhance the image based on mapping the coefficients of the low-resolution dictionary to coefficients in the high-resolution dictionary. Various papers describe this, including “On Single Image Scale-Up Using Sparse-Representations” by R. Zeyde et al and published in 2010, “Image super-resolution via sparse representation” by J. Yang and published in 2010, and “Coupled Dictionary Training for Image Super-Resolution” by J. Yang et al and published in 2012, which are incorporated by reference.
A disadvantage of using dictionary learning based super resolution techniques on low-resolution images to attempt to recreate the high-resolution image is the need for two dictionaries, one for the low-resolution image and a separate dictionary for the high-resolution image. It is possible to have a single combined dictionary, but in essence there is always in practice an explicit modelling for each resolution to enable representations to be matched between the two resolutions of image.
A further disadvantage of using dictionary learning, however, especially when used with an over-complete dictionary, is the amount of data that needs to be transferred along with the low-resolution image in order to recreate a high-resolution image from the low-resolution image.
Another disadvantage of dictionary learning approaches is that these tend to use a local patch averaging approach in the final step of reconstruction of a higher-resolution image from a lower-resolution image, which can result in unintentional smoothing in the reconstructed image.
Another further disadvantage of dictionary learning approaches is that it is very slow and can have high memory requirements, depending on the size of the dictionary.
Artefact Removal in Visual Data
Visual data artefacts and/or noise can often be introduced into visual data during processing, particularly during processing to compress visual data or during transmission of the visual data across a network. Such introduced artefacts can include blurring, pixelation, blocking, ringing, aliasing, missing data, and other marks, blemishes, defects, and abnormalities in the visual data. These artefacts in visual data can degrade the user experience when viewing the visual data. Furthermore, these artefacts in visual data can also reduce the effectiveness of visual data processing techniques, such as image super resolution, as well as other visual tasks such as image classification and segmentation, that use processed images as an input.
Lossy compression, in which a visual data is encoded using inexact approximations, is a particularly common source of artefacts. Lossy compression is often required to reduce the size of a digital image or video in order to transmit it across a network without using an excessive amount of bandwidth. High visual data compression ratios can be achieved using lossy compression, but at the cost of a reduction in quality of the original visual data and the introduction of artefacts.
The transmission of visual data across a network can itself introduce artefacts to the visual data through transmission errors between nodes in the network.
Current methods of removing artefacts (herein referred to as visual data fidelity correction) generally only correct a specific types of artefact. Examples of such techniques include deblocking oriented methods, such as Pointwise Shape-Adaptive Discrete Cosine Transform (hereafter referred to as Pointwise SA-DCT), which deal with blocking artefacts. Such techniques do not perform well, and can also introduce further artefacts as a side effect, such as the over smoothing of visual data.