Traditionally, compression (specifically for image/video) has focused on achieving maximum compression rates by eliminating redundancy (approaching entropy) at great cost, by good analysis of the limited set of input domains. Specifically, this often meant elaborate pattern matching or clever statistical parameter estimation for a known class of distributions (in the limit exemplified by the “Hutter prize” framework). Theoretical limitations have been well understood for some time now, implying formal absence of a universal solution—mathematically speaking, the general problem of minimal Kolmogorov complexity is not a computable function. Nonetheless, practical work was focused on a handful of industry-backed standards of increasing complexity or specialization, typically based on a sequential processing model. (e.g. H.264/5).
Rapidly increasing image sensor resolutions (primarily in strictly spatial resolution, but also in bit depth and video frame rate) imply an exponential growth of the domain and variety of input material. Concurrently, rapidly dropping costs of sensor devices and supporting transmission and storage technologies made these advances broadly available in consumer devices. Consequently, current standards cannot nearly cover all of these emerging needs and application contexts, particularly in the combination of high throughput and low cost. Due to a scalability that naturally follows from its dyadic hierarchical representation, the class of DWT-based subband approaches (the JPEG2000 standard most prominent among them) has emerged as a viable candidate for this challenge, utilized for still images as well as video in the I-frame only mode (i.e. independently coded frames with no temporal information considered). Several references articulate these issues clearly, e.g. D. S. Taubman and M. W. Marcellin, “JPEG2000 Image Compression Fundamentals, Standards and Practice”, Springer Science+Business Media, 2002. Efficient lifting methods of implementing the DWT have been disclosed in “Efficient wavelet-based compression of large images” (U.S. Pat. No. 6,546,143 B1) and elsewhere.
Another natural approach to further scale up with the galloping high-throughput demand is by leveraging massively parallel architectures, increasingly affordable as SW-programmable GPGPUs, while also implementable is HW-based FPGA and ASIC solutions. This requires identifying strong (coarse-grained) parallelism, i.e. increasing the proportion P of parallel operations in the model underlying Amdahl's Law. Research has identified the major obstacles for adapting JPEG2000 further toward such coarse-grained parallelism, i.e., scaling up with increased resolutions, as disclosed in: J. Matela et al, “Efficient JPEG2000 EBCOT Context Modeling for Massively Parallel Architectures”, Data Compression Conference (DCC'11), pp. 423-432, Snowbird, USA, 2011; J. Matela, “GPU-Based DWT Acceleration for JPEG2000”, Annual Doctoral Workshop on Mathematical and Engineering Methods in Computer Science, pp. 136-143, 2009; J. Franco et al., “A Parallel Implementation of the 2D Wavelet Transform Using CUDA, Univ. de Murcia, Spain. In this standard, ⅔ of its major components (i.e., the actual compression) are known to be computationally very intensive, with most processing time spent in the EBCOT context-based adaptive bit-plane arithmetic coder (context modeling and arithmetic coding profiled as accounting for 61% in some studies, over 70% in others). The present invention defines a suitable, highly parallel replacement for the state-of-the-art EBCOT bit-plane coding method, enabling a reduction in processing time while considering key local data that are still most amenable to good compression rates.
More specifically, this translates focus to a processing granularity that seeks optimal balance between compact 2D locality (enabling higher degrees of parallelization and consequent speedup) and size coverage (retaining higher compression rates). It has long been understood in prior art dealing with concurrent programming that “many non-local calculations, virtually trivial on single-thread systems, like counting non-zero pixels in a 2D image, become hard to solve on the GPU, since its inherently parallel nature can only be utilized if the output of several parallel units is combined” (as disclosed in: G. Ziegler et al, “GPU PointList Generation using HistoPyramids”, Proc. VMV2006, Germany, 2006, pp. 137-144), and that resolving issue that will require new, explicitly parallel methods.
The present invention is premised on the fact that in DWT-based processing of UHD input the main redundancy (and consequent compression potential) comes from typical artifacts of H-bands on initial levels—large areas of contiguous zeros interspersed with sparse and highly redundant non-zero values (NZV). The postulated optimal balance between compact 2D locality and size coverage, is a achieved by hierarchically combining results from several parallel units in a manner that retains minimal encoding (ideally, single bit) for large contiguous 2D zero areas. Prior art has identified the concept of “reduction operator” (as disclosed in: G. Ziegler et al, “RealTime QuadTree Analysis using HistoPyramids”, Proc. of IS&T and SPIE Conf. on Electronic Imaging, 2007), used by independent threads, which repeatedly processes four input cells into one, starting at the resolution level of the original input image, eventually geared at generating quadtree data structures; specifically, the cited work described this basic concept in the context of HistoPyramids and QuadPyramids of a QuadPyramid Builder. However, the cited prior art was generated in the context of solving other problems, with a focus on access methods, traversibility, etc. It did not explicitly focus on issues of data compression nor can it be obviously extended in that direction; consequently, it does not recognize or separate operations of coding and decoding.
One aspect of the present invention allows for moderate amount of lightweight and strategically focused adaptation in a way that does not compromise strong parallelism or add bulky side information.
As indicated above, most redundancy and savings come from relatively localized (spatio-temporally) contexts, which allows to avoid combinatorial explosions in context modeling and estimation, maintain causality, and avoid possibly massive side information to be transmitted. This in turn enables specific benefits: a) real-time (including low-latency) processing; b) random access (temporal) and non-linear editing (NLE) capability for video-material; c) random access (spatial), allowing region-of-interest (ROI) focus and flexibility.
For the given requirements, classical VLC coding (e.g. Elias beta/gamma, Golomb-Rice) appears preferable over Huffman-style recalculated/adaptive entropy (arithmetic, range) coding in terms of both lower computational complexity and lack of inherent sequential bias for the given problem. A limited number of predetermined distributions usually adequately model high-band coefficient magnitude values. Generally, principles of VLC are well-known to one skilled in the art, having been described in standard references, e.g., D. Salomon and P. Motta, “Handbook of Data Compression”, Springer, 2010.
Parameter estimation for the underlying distributions should combine locality for focus and globally observed correlation among frames/planes/bands of a single source. Further lossless compression gains achieved by identifying band-specific local patterns of small non-zero coefficients and signs, and using appropriate entropy coding on such local vectors. Appropriately designed code stream and overall file format allow for optimal just-in-time decoding while providing the desired scalability.
Basic notions of concurrent (parallel) programming—including threads, synchronization and common memories—should be well known to one skilled in the art, and have been described in standard references, e.g., M. Raynal, “Concurrent Programming: Algorithms, Principles and Foundations”, Springer, 2013.
Finally, it should be pointed out that one of the multiple practical classifications of codecs defines the following three basic classes:
1. Acquisition
2. Intermediate
3. Distribution
The key distinguishing criterion here is the work balance between the two codec ends; in addition to the more traditional type 3 applications (compress once, decode many—e.g. distribution and broadcast of film material), there are many now that fall into types 2 (symmetric—e.g. digital film editing workflow) or 1 (inexpensive and ubiquitous coding, infrequent and more elaborate decoding—e.g. sparse representation, distributed multiple description coding, compressive sensing). Most practical flexibility comes from intermediate codecs, and the balance of compression and decompression efforts would put the present invention in that class.