1. Field of the Invention
This invention relates to systems and methods for lossless compression of data.
2. Background of the Invention
Huffman coding is an entropy encoding algorithm used for lossless data compression (see D. A. Huffman, “A method for the construction of minimum-redundancy codes” Proc. IRE 40 (9). pp. 1098-1101. 1952). Huffman coding refers to the use of a variable-length code table for encoding a source literal (such as a character in a file) where the variable-length code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source literal. Huffman coding uses a specific method for choosing the representation for each literal, resulting in a prefix code (sometimes called “prefix-free codes”, that is, the bit string representing some particular literal is never a prefix of the bit string representing any other literal) that expresses the most common source literals using shorter strings of bits than are used for less common source literals. Huffman was able to design the most efficient compression method of this type: no other mapping of individual source literals to unique strings of bits will produce a smaller average output size when the actual literal frequencies agree with those used to create the code. The pseudo code of the Huffman algorithm is described below.
Algorithm 1: Compute Binary Huffman codeInput: Set of N pairs of (literal, count/probability).Output: Set of N pairs of (literal, hufcode).1- Sort the literal ensemble by decreasing count.2- Select the 2 least probable literals, and assign each a bit 0 and 1,respectively.3- Substitute the selected literals by a composite literal summing theircount, and re-order it.4- While there remains more than one literal, do steps through 6.5- Select two least probable literals, and append 0 and 1 respectively.6- Substitute the selected literals by a composite literal summing theircounts, and re-order it.7- The hufcode of each literal is given by the concatenation of theaggregate bits.
The above Huffman tree introduces two critical inefficiencies into an implementation of the coding scheme. Firstly, each node of the tree must store either references to its child nodes or the literal that it represents. This is expensive in memory usage and if there are a high proportion of unique literals in the source data then total the size of the code tree can account for a significant amount of the overall encoded data. Secondly, traversing the tree is computationally costly, since it requires the algorithm to jump randomly through the structure in memory as each bit in the encoded data is read in.
Canonical Huffman codes address these two issues by generating the codes in a clear standardized format; all the codes for a given length are assigned their values sequentially. This means that instead of storing the structure of the code tree for decompression only the lengths of the codes are required, reducing the size of the encoded data. Additionally, because the codes are sequential, the decoding algorithm can be dramatically simplified so that it is computationally efficient.
Given a list of literals sorted by Huffman code length (denoted by L) in increasing order, the following pseudo code will print a canonical Huffman code book:
Algorithm 2: Compute Canonical Huffman Code through its Bit LengthInput : Sorted (Si, Li), i = 0,1,2,...,N − 1Output : Sorted (Si, Ci), i = 0,1,2,...,N − 1Set C0 = 0For i = 0,1,...,N − 2, do: Set Ci+1 = (Ci + 1) □ (Li+1 − Li)
Software compression is very well understood and the source code for programs like GZIP are readily accessible in the public domain. The algorithm to implement compression is relatively complex to do in hardware for high bandwidth application given that the GZIP compression is based on the DEFLATE algorithm and Huffman coding (DEFLATE Compressed Data Format Specification version 1.3. IETF. May 1996. p. 1. sec. Abstract. RFC 1951. Retrieved 11 Nov. 2012). The idea behind compression is to use copy commands later in a data stream that appears earlier in the data stream. As such all compression implementations require a search history buffer and some type of compare length function to determine the longest length that can be used for the copy command. One method to implement the search matching in the previous history is to implement a hash function, which maps the variable length string to fixed value, which can be used to compare the data stream being compressed. The data is hashed and compared to generate a copy or literal command per Lempel-Ziv algorithm or comparable algorithm (Ziv, J.; Lempel, A. (1978). “Compression of individual sequences via variable-rate coding”. IEEE Transactions on Information Theory 24 (5): 530). Once the statistics for the data to be compressed are gathered, they are then Huffman encoded and then compressed to be sent out.
Huffman coding is not a simple algorithm to implement in hardware as outlined above. The systems and methods disclosed herein provide an improved approach for implementing canonical Huffman coding using very large scale integration (VLSI) circuits.