1. Field of the Invention
The present invention relates to a method for representing a sequence of octets as a sequence of bits and a method for packing the latter bit sequence into octets, and more particularly to adaptive and invertible or lossless digital data compression systems.
2. Description of the Prior Art
The following definitions of terms and conventions are used:
code: a sequence of bits of arbitrary length PA1 octet: 8 bits stored or transmitted as a unit. PA1 string: a sequence of arbitrary octets. PA1 block: a string of input octets. PA1 compressed data set: a series of output blocks, corresponding to successive blocks of input data. PA1 All codes of a given bit length have lexicographically consecutive values, in the same order as the symbols they represent; and PA1 Shorter codes lexicographically precede longer codes. PA1 1) Count the number of codes for each code length. Let bl.sub.-- count[N] be the number of codes of length N, N&gt;=1. PA1 2) Find the numerical value of the smallest code for each code length: PA1 3) Assign numerical values to all codes, using consecutive values for all codes of the same length with the base values determined at step 2. Codes that are never used (which have a bit length of zero) must not be assigned a value. PA1 a pair of Huffman code trees that describe the representation of the compressed data part; and PA1 a compressed data part. (The Huffman trees themselves are compressed using Huffman encoding.) PA1 literal octets (of strings that have not been detected as duplicated within the previous 32K input octets); and PA1 pointers to duplicated strings, where a pointer is represented as a pair &lt;length, backward distance&gt;.
LZ77 Compression Method
Many data compression methods have been derived from improvements to a compression technique developed by Lempel and Ziv and published in 1977. In this method, a large input store is maintained. Input bytes are put in this store as they are received. The decompressor must keep an identical copy of this store as it expands data. As strings of bytes are input, the store is searched to locate previous occurrences. If no previous occurrence (longer than some minimum length) is found, the bytes themselves are used. If an identical previous sequence (longer than some minimum length) is found, the position and length of that occurrence is used. Positions and lengths are encoded in a way to distinguish them from literal characters. The encoded stream is sent to the decoder.
The decoder interprets codes as it accepts them. As it decodes literal characters, it appends them to its output store. When it decodes a position and length, it refers to its output store to obtain the specific string to copy and append to the output store. Thus, the decompression process parallels the compression process and keeps an identical store that is used to refer to previous strings.
Improvements to LZ77 Method
Extensive searching is required to locate the longest previous occurrences. There is a practical limit to the size that the store can take. Additional compression techniques can be applied to the data stream. Variations of solutions to enhance LZ77 have been made. The improvement disclosed herein is mostly based on the `deflate` method. The `deflate` format and compression algorithm was developed by Jean-loup Gailly in 1992. The block formats and Huffman codes were defined therein. The format is used extensively on the Internet and in commercial products (such as Pkware's PKZIP).
Location of the longest previous sequence that matches the current sequence made use of hashing chains. Huffman coding is applied to the encoded literal strings, displacements, and lengths.
Synopsis of Prefix and Huffman Coding
Prefix coding represents symbols from an a priori known alphabet by bit sequences (codes), one code for each symbol, in a manner such that different symbols may be represented by bit sequences of different lengths. A parser can always parse an encoded string unambiguously symbol-by-symbol.
A prefix code is defined in terms of a binary tree in which the two edges descending from each non-leaf node are labeled 0 and 1 and in which the leaf nodes correspond one-for-one with (are labeled with) the symbols of the alphabet; then the code for a symbol is the sequence of 0's and 1's on the edges leading from the root to the leaf labeled with that symbol. For example: ##STR1##
A parser can decode the next symbol from an encoded input stream by walking down the tree from the root, at each step choosing the edge corresponding to the next input bit.
Given an alphabet with known symbol frequencies, the Huffman algorithm allows the construction of an optimal prefix code (one which represents strings with those symbol frequencies using the fewest bits of any possible prefix codes for that alphabet). Such a code is called a Huffman code. (See the references for additional information on Huffman codes.)
The Huffman codes used for each alphabet in the `deflate` format have two additional rules:
Recoding the example above to follow this rule as follows, assuming that the order of the alphabet is ABCD:
______________________________________ Symbol Code ______________________________________ A 10 B 0 C 110 D 111 ______________________________________
I.e., 0 precedes 10 which precedes 11x, and 110 and 111 are lexicographically consecutive.
Given this rule, one can define the Huffman code for an alphabet just by giving the bit lengths of the codes for each symbol of the alphabet in order; this is sufficient to determine the actual codes. In our example, the code is completely defined by the sequence of bit lengths (2, 1, 3, 3). The following algorithm generates the codes as integers, intended to be read from most- to least-significant bit. The code lengths are initially in tree[I].Len; the codes are produced in tree[I].Code.
______________________________________ code = 0; b1.sub.-- count[0] = 0; for (bits = 1; bits &lt;= MAX.sub.-- BITS; bits++) next.sub.-- code[bits] = code = (code + b1.sub.-- count[bits-1]) &lt;&lt; 1; } ______________________________________
______________________________________ for (n = 0; n &lt;= max.sub.-- code; n++) { len = tree[n].Len; if (len == 0) continue; tree[n].Code = next.sub.-- code[len]++; } ______________________________________