Many techniques are known in the art to deal with compression and decompression of multidimensional signals or of signals evolving along time. This is the case of audio signals, video signals and other multidimensional signals like volumetric signals used in scientific and medical areas.
In order to achieve high compression ratios, those techniques exploit the spatial and time correlation inside the signal. For example, conventional methods identify a reference and try to determine the difference of the signal between a current location and the given reference. This is done both in the spatial domain, where the reference is a portion (e.g., a block, or “macro-block”) of already received and decoded spatial plane, and in the time domain, where a single instance in time of the signal (e.g., a video frame in a sequence of frames) is taken as a reference for a certain duration. This is the case, for example, of MPEG (Moving Pictures Expert Group)—family compression algorithms, where previously-decoded macro blocks are taken as reference in the spatial domain and I-frames and P-frames are used as reference for subsequent P-frames in the time domain.
Known techniques exploit spatial correlation and time correlation in many ways, adopting several different techniques in order to identify, simplify, encode and transmit differences. In accordance with conventional methods, in order to leverage on spatial correlation of residuals within a respective block of picture elements a domain transformation is performed (for example into a frequency domain) and then lossy deletion and quantization of transformed information is performed, typically introducing some degree of block artifacts. In the time domain, instead, conventional methods transmit the quantized difference between the current sample and a motion-compensated reference sample. In order to maximize the similarity between samples, encoders try to estimate the modifications along time occurred vs. the reference signal. In conventional encoding methods (e.g., MPEG family technologies, VP8, etc.), this is called motion estimation and compensation.
Today's CPUs (Central Processing Units) and GPUs (Graphics Processing Units) are typically very powerful; a single GPU can include several hundreds of computing cores to perform parallel processing of information. When using current technology hardware, very large portions of an image can be stored in a processor cache for processing. The need to fragment images into a multitude of small blocks, which was a driving factor when JPEG and MPEG were created, as processors from that era could only deal with very small chunks of video data at a time—and back then only sequentially—no longer applies to modern CPUs and GPUs. Thus, a large portion of available processing power may go unused when implementing MPEG-like types of encoding/decoding, with blocking artifacts needlessly introduced into the signal.
Also, compared to what was current when MPEG was developed, modern day applications typically require much higher definition video encoding and much higher overall playback quality. In high-definition (e.g., fullHD, UltraHD), high-quality videos (e.g., relatively invisible artifacts with respect to the original signal), there is a much larger difference between areas with low detail (potentially even out of focus) and areas with very fine detail. This makes the use of frequency-domain transforms such as those used in JPEG-based and MPEG-based methods even more unsuitable for image processing and playback, since the range of relevant frequencies is getting much broader.
In addition, higher resolution images include a higher amount of camera noise and/or film grain, i.e., very detailed high-frequency pixel transitions that require many bits to encode, but that can be quite irrelevant for viewing vs. similar high-frequency pixel transitions of borders of objects.
Another aspect neglected in the known art, aside from few attempts, is the quality scalability requirement. A scalable encoding method would encode a single version of the compressed signal and enable the delivery to different levels of quality, bandwidth availabilities, and decoder complexity. Scalability has been taken into consideration in known methods like MPEG-SVC and JPEG2000, with relatively poor adoption so far due to computational complexity and, generally speaking, compression inefficiency relatively to non-scalable techniques.
In the past, as a scalable alternative to JPEG/MPEG standards for encoding/decoding, so-called image Laplacian pyramids had been used for encoding/decoding purposes. For example, conventional Laplacian pyramids systems created lower resolution images using Gaussian filters and then built the pyramid of the differences between the images obtained by upsampling with a rigidly programmed decoder back from the lower resolution levels to the original level. Use of conventional Laplacian pyramid encoding has been abandoned, due to their compression inefficiency.
The domain transformations of residuals leveraged so far by state of the art encoding methods (e.g., Fourier transforms, Discrete Cosine Transforms, Hadamard transforms, Wavelet transforms, etc.) suffer from a number of problems.
First of all, the very choice of transforming into a frequency domain makes them unsuitable to properly exploit the correlation across large portions of a signal (i.e., portions with a high number of samples for each of the dimensions), since real-world signals typically show limited amounts of periodicity. As a consequence, frequency domain transforms are performed on blocks that are at the same time too big and too small: too big to be computationally simple, too small to sufficiently exploit the correlation of a high-resolution signal. For instance, in order to exploit the correlation of a large enough set of samples while at the same time managing computational complexities, conventional image and video encoding techniques operate on blocks of 8×8, 16×16 or 32×32 elements: clearly too small to fully capture the correlation of image patterns in a high definition image (e.g., with 8 million pixels), but large enough to absorb significant computational power.
Secondly, known methods leveraging frequency domain transforms implicitly assume that humans are sensitive to harmonics (e.g., frequencies of color transitions) in a way that does not depend on the direction of the transition, while several studies have shown that humans recognize the sharpness of a transition much better than the precise direction/angle of the transition, especially when watching complex shapes.
Third, known lossy encoding techniques operate by quantizing the results of the transform, inevitably generating two problems: (1) block-based artifacts between one block and the neighboring ones, to be corrected with relatively complex de-blocking image processing methods; (2) impossibility to easily control the maximum error in an encoded image, since actual pixel values are the result of an inverse transform of dequantized parameters, so that quantization errors in the quantized parameters of a block combine with one another in manners that are difficult to manage without multiple re-encodings and/or extremely complex quantization schemes. Avoiding block artifacts and guaranteeing maximum error control are particularly important features, especially in applications such as medical imaging or professional image/video production.