Lossless data compression is being pervasively used in data storage and communication systems to reduce the cost and/or improve speed performance. A large number of lossless data compression algorithms exist today, spanning a wide spectrum on the trade-off between data compression ratio and data processing complexity. A higher data processing complexity tends to cause slower compression/decompression throughput. The most well-known and widely used lossless compression algorithm is DEFLATE, which is used to generate/decompress GZIP, ZIP, and PNG files. In spite of its relatively good compression ratio, implementing DEFLATE on a central processing unit (CPU) suffers from low throughput, e.g., tens of MB/s compression, which is significantly inadequate for many real-life applications. As a result, a number of high-speed compression algorithms have been developed, most notably Snappy and lz4 algorithms. These algorithms can achieve 10× higher compression throughput on the CPU compared with DEFLATE, at the cost of worse compression ratio. There have been prior efforts that speed-up the DEFLATE algorithm by off-loading the processing into a dedicated hardware accelerator, e.g., an ASIC (application specific integrated circuit) or FPGA (field programmable gate array) chip which is connected to the CPU through interfaces such as PCIe.
Conventional practice off-loads the entire DEFLATE algorithm to the accelerator, which leads to two drawbacks. Firstly, the CPU has to send/receive the original raw data to/from the accelerator through interfaces such as PCIe for compression/decompression. As a result, the achievable compression/decompression throughput is limited by the interface bandwidth, even if the accelerator itself could compress/decompress data at much higher throughput. Secondly, DEFLATE compression/decompression consumes significant silicon resources on the accelerator, in particular for FPGA-based accelerators, leading to a higher implementation cost.