1. Field of the Invention
The invention relates to decoding of variable-length codes (VLC), such as Huffman codes. More particularly, the invention relates to the use of multiple decoders to decode VLC codes.
2. Description of the Related Art
For electronic data storage and transmission, text or other strings of symbols are commonly represented by binary codes. One example of such a code is the ASCII representation of letters, numbers and other characters as 8-bit binary codes, or bytes. As another example, the four letters A, B, C and D may be represented with two-bit codes, such as:
Code 1SymbolBinary CodeA00B01C10D11
Thus, the 8-symbol message “ABACBAAD” may be represented using Code 1 in a 16 bit string as follows:
ABACBAAD0001001001000011
One of the objectives of data compression is to represent data using as few bits as possible. One common method of reducing the number of bits is to use binary codes of different lengths to represent different symbols. Compression can be achieved if shorter codes are used for symbols that occur more often, and longer codes for symbols that occur less often. For example, the message “ABACBAAD” uses A more often than B, and B more often than C or D. Thus, the message can be compressed if we use a variable-length code, such as the following:
Code 2SymbolBinary CodeA0B10C110D111
Using Code 2, the same 8-symbol message “ABACBAAD,” which was represented with 16 bits using Code 1, is represented with only 14 bits using Code 2 as follows:
ABACBAAD01001101000111
Of course the variable-length codes should be carefully determined so that the possibility of ambiguous decode results is minimized. For example, Code 3, below, would produce ambiguous decode results. In particular, using Code 3, the three bits “010” would result in an ambiguous decode as the decoder would have no way of determining if the code is a “010” representing the letter D or a “0” followed by a “10” representing the symbols “AC”.
Code 3 (Invalid)SymbolBinary CodeA0B01C10D010Huffman Coding
The standard means of choosing variable-length codes “carefully,” to enable compression while preventing ambiguity, is called Huffman coding. A Huffman code assigns variable-length binary codes to an alphabet of symbols by arranging the symbols into a Huffman tree. For example, FIG. 1 illustrates a Huffman tree for the above-listed Code 2.
To encode a symbol using a Huffman tree, for example the Huffman tree illustrated in FIG. 1, the tree is followed from the top down, recording the 0's and 1's labeling the branches taken, until the desired symbol is reached. For example, to encode the symbol “C”, the tree is followed to the right at the first branch (1), to the right at the second branch (1), and to the left at the third branch (0) to reach the symbol “C” in the tree, resulting in a recorded code of “110”.
Similarly, to decode a string of bits, the tree is followed from the top down along the branches specified by consecutive bits, until a symbol is reached. The reached symbol is recorded and the position on the tree is reset to the highest position, where the tree is subsequently followed from the top down to determine additional symbols. For example, to decode “10110”, the tree is followed to the right (1), then to the left (0), and a symbol “B” is reached. Returning to the top of the tree, the tree is followed to the right (1), then right again (1), then left (0), and a symbol “C” is reached. The bit stream “10110” is thus decoded to the symbols “BC”. One of skill in the art will recognize that encoding and decoding by the above described Huffman tree “walking” process is theoretical. In practice, various faster table-based techniques are used to encode and decode Huffman codes.
Lempel-Ziv Coding
Lempel-Ziv (LZ) coding, in various forms such as LZ77, is often used in conjunction with Huffman Coding for data compression. Whereas Huffman coding seeks to reduce the number of bits consumed by individual symbols by assigning short codes to common symbols, Lempel-Ziv coding seeks to reduce the number of symbols stored by replacing repeated strings of symbols with special symbols (length/distance pairs) that describe the repetition. For example, to compress the string of 12 symbols provided below, Lempel-Ziv coding would first determine that there are no patterns repeated in the first four symbols and record the first 4 symbols, ABCD, literally.
ABCDABCBCDABABCD
The next two symbols, ABC, have been seen before in the same order. So instead of presenting them literally, Lempel-Ziv coding can use a length-distance pair to describe how to copy this string from earlier data. Since the previous ABC string begins 4 symbols back, and is 3 symbols long, Lempel-Ziv may use the length/distance pair (l:3)/(d:4), indicating that four symbols back from the present symbol are the next three symbols in the stream.
ABCDABCBCDABABCDl:3d:4
Similarly, the last five symbols, BCDAB, have also occurred before, six symbols back in the original. They can be replaced with the length/distance pair (l:5)/(d:6):
ABCDABCBCDABABCDl:3d:4l:5d:6
Thus, in the above described example, Lempel-Ziv coding replaced the original 12 symbols with only 8, half of which are “literals,” and the rest of which compose length/distance pairs. Note that “d:4”, for example, is considered a single abstract symbol. However, the characters “d”, “:”, and “4” are not literally written in a compressed file. In one embodiment, the various abstract length and distance symbols are represented in binary form.
Huffman-Coded LZ
Since Lempel-Ziv coding is an attempt at compression, the various abstract symbols it produces, including literals, length and distance symbols, should be represented in binary form as efficiently as possible. As mentioned previously, one method of representing symbols efficiently in binary form is Huffman coding. Several data compression algorithms include two main stages, namely, LZ coding and Huffman coding. An original string of data symbols is first replaced with a shorter string of LZ symbols, and then the LZ symbols are Huffman-coded into a binary representation.
In some compression algorithms, the possible length symbols are lumped together with the possible literals into a single symbol alphabet for Huffman coding, whereas the possible distance symbols are Huffman coded as a separate second alphabet. Thus, one Huffman tree defines the coding of literal and length symbols while a second Huffman tree defines the coding of distance symbols. Huffman-coded length symbols are directly distinguishable from coded literals because they each have distinct binary codes from the same Huffman tree. Furthermore, Distance codes are distinguished from Literal/Length codes by context; that is, in typical use a Distance code immediately follows a length code. Therefore, after a length symbol has been decoded, the decoder knows that the following Huffman code is a Distance code and the Distance tree should be accessed to decode.
In many applications, general-purpose Length/Literal and Distance Huffman trees are defined, which are reasonably well-suited to many data-compression applications. Such Huffman trees are referred to as “fixed” Huffman coding, because they use fixed predefined Huffman trees. Alternatively, Huffman trees may be defined using a particular file (such as Lempel-Ziv coded data) based on the actual frequencies of the various symbols in that file. This technique has been referred to as “dynamic” Huffman coding. If a particular length symbol is used very frequently in a certain LZ-coded file, then that symbol can be placed near the top of the dynamic length/Literal Huffman tree, so that it receives a relatively short Huffman code. For that matter, if a particular distance symbol is never used, it can be entirely omitted from the dynamic Distance Huffman tree, thus allowing other Distance codes to be shorter.
Dynamic Huffman coding requires that the specially-constructed Huffman trees, referred to as dynamic trees, used to code a certain file get stored in the compressed version of that file or are otherwise made available to the decoder, along with the Huffman codes themselves. For example, Dynamic trees may be transmitted in a data stream prior to the Huffman coded data. In one embodiment, the beginning and end of each dynamic tree are indicated by predefined codes in the data stream. Alternatively, the number and length of dynamic trees may be provided in the data stream before the dynamic trees.
Extra Bits
One important consequence of the storage of dynamic Huffman trees is the use of “extra-bits” fields in Huffman-coded LZ. In practical compression algorithms, there are a lot of possible length and Distance codes. Various algorithms need symbols to represent lengths of from 3 to 258 bytes, and distances of from 1 to several million bytes. While it might, all else being equal, be most efficient to give each possible length and distance its own Huffman code, that efficiency is quickly lost when dynamic Huffman trees with thousands or millions of entries are stored in the data stream.
One solution is to only provide Huffman codes for a relatively small number of length and distance ranges. The distance range of 1024-2047 bytes, for example, could be given a single Huffman code, and likewise the ranges of 2048-4095 bytes and of 4096-8191 bytes could each get a single Huffman code. By following such a geometric progression of range limits, the total number of Huffman codes required can be kept low enough that storing dynamic Huffman trees is not prohibitive.
In order to translate a Lempel-Ziv copy symbol in such a scheme, a Length or Distance Huffman code can be followed by several “extra bits” that provide the exact position within the coded range. For example, whenever a Distance code representing the range 1024-2047 bytes appears, it would be followed by a 10-bit extra-bits field, where the 10-bits are interpreted as a binary integer between 0 and 1023. The binary integer may then be added to the beginning point of the range, e.g., 1024, to yield an exact distance. The number of extra bits, and their meaning, is understood in the context of the Huffman code that preceded them. For example, in one embodiment, a “1024-2047” Distance Huffman code would be followed by exactly a 10-bit extra-bits field and a “2048-4095” Distance code would be followed by exactly an 11-bit extra-bits field.
Extra-bits may also be used in dynamic Huffman coding situations, other than coding Lempel-Ziv length and distance symbols, to specify whatever parameters are required to fully interpret the symbol alphabet without resorting to prohibitively large Huffman trees. Extra bits typically appear after the Huffman code they modify so that their number and meaning are understood in the context of that code.
DEFLATE is a compressed data format that is described in the Network Working Group Request for Comments (RFC) 1951, authored by P. Deutsch of Aladdin Enterprises in May 1996, which is hereby incorporated by reference for all purposes. Compression algorithms using the DEFLATE compressed data format provide lossless data compression using a combination of the Lempel-Ziv algorithm and Huffman coding. More particularly, DEFLATE uses Lempel-Ziv coding to reduce the number of symbols in a give block of data and subsequently uses Huffman coding to reduce the number of bits consumed by individual symbols (as discussed above). DEFLATE is used in various compressed formats, including GZIP and PKZIP.
GZIP is a compressed archive format that is described in the Network Working Group RFC 1952, authored by P. Deutsch of Aladdin Enterprises in May 1996, which is hereby incorporated by reference for all purposes. GZIP uses DEFLATE for compression and additionally includes various header information with the DEFLATE compressed data.
RAR is a commercial compression and archiving software product authored by Eugene Roshal. Various compression applications have been created by business entities, such as RARLAB, rarsoft, and WinRAR, for example.
As noted above, there are many compression techniques that make use of Huffman coding and decoding. Accordingly, systems and methods for increasing the speed and efficiency of decoding Huffman codes are desired.
In operation, one or more Huffman trees indicating the coding of the various symbols in a data block are initially created for the data block using an encoder. The Huffman trees may then be stored in any type of memory or storage device for later use by a Huffman decoder. However, the Huffman trees for a coded data block must be placed in a location that is accessible by the Huffman decoder, typically remote to the encoder, before decoding of the data block may begin. The transfer of the Huffman trees from the storage device to the decoder memory often takes considerable time and delays the decode operation. Thus, a system and method of reducing the time required to store Huffman trees in a decoder memory is desired.
During a decode operation of a data stream having variable length coded data, such as a Huffman decode operation, a decoder determines the length of each variable length code as the variable length code is received by the decoder. The process of determining the length of a variable length code, however, consumes valuable time. Because a decoder can only decode a variable length code if at least the position of the first bit of the code is known, a decoder cannot decode a subsequent variable length code until the length of the previous variable length codes have been determined. Thus, the decode process is slowed by the necessity of determining code lengths of the variable length codes. Accordingly, a system and method for reducing the time required to determine a length of variable length codes in a data stream is desired.