Due to the insignificant differences between data compression in data storage and data communication systems, only data storage systems are referred to, particularly the data files stored in such systems. However, all data storage systems can easily be extended to cover data communications systems and other applications as well. A file is assumed to be a sequential stream of bytes or characters, where a byte consists of some fixed number of bits (typically 8), and the compression system transforms this input byte stream into a "compressed" output stream of bytes from which the original file contents can be reconstructed by a decompression unit.
It is well-established that computer data files typically contain a significant amount of redundancy. Many techniques have been applied over the years to "compress" these files so that they will occupy less space on the disk or tape storage medium or so that they can be transmitted in less time over a communications channel such as a 1200 baud modem line. For example, there are several widely used commercial programs available for personal computers (e.g., ARC Software by Systems Enhancement Associates, Inc., Wayne, N.J. 1985) which perform the compression and decompression functions on files. It is not uncommon for such programs to reduce the size of a given file by a 2:1 ratio (or better), although the amount of reduction varies widely depending on the contents of the file.
There are many approaches in the prior art for compressing data. Some of these approaches make implicit assumptions about certain types of files or data within the files. For example, a bit image of a page produced using a scanner typically has most of its pixels blank, and this tendency can be exploited by a compression algorithm to greatly reduce the size of such files. Similarly, word processing files contain many ASCII characters which are easily compressed using knowledge of which characters (or words) occur most frequently in the language of interest (e.g., English). Other compression methods are independent of the file type and attempt to "adapt" themselves to the data. In general, type-specific compression techniques may provide higher compression performance than general-purpose algorithms on the file for which the techniques are optimized, however they tend to have much lower compression performance if the file model is not correct. For instance, a compression method optimized for English text might work poorly on files containing French text.
Typically, a storage system does not "know" what type of data is stored within it. Thus, data-specific compression techniques are avoided, or they are only used as one of a set of possible techniques. For example, ARC uses many methods and picks the one that performs best for each file. However, this approach requires significant computational overhead compared to using a single compression method.
Another important aspect of any compression method is the speed at which a file can be processed. If the speed of compression (or decompression) is so low as to significantly degrade system performance, then the compression method is unacceptable even though it may achieve higher compression ratios than competing methods. For example, with streaming tape systems, if the file cannot be compressed fast enough to provide data at the required rate for the tape drive, the tape will fall out of streaming and the performance and/or capacity gains due to compression will be nullified.
One of the most common compression techniques is known as run-length encoding. This approach takes advantage of the fact that files often have repeated strings of the same byte (character), such as zero or the space character. Such strings are encoded using an "escape" character, followed by the repeat count, followed by the character to be repeated. All other characters which do not occur in runs are encoded by placing them as "plain text" into the output stream. The escape character is chosen to be a seldom used byte, and its occurrence in the input stream is encoded as a run of length one with the escape character itself as the character. Run-length encoding performs well on certain types of files, but can have poor compression ratios if the file does not have repeated characters (or if the escape character occurs frequently in the file). Thus, the selection of the escape character in general requires an extra pass on the data to find the least used byte, lowering the throughput of such a system.
A more sophisticated approach is known as Huffman encoding (see, Huffman, David A., "A Method for the Construction of Minimum Redundancy Codes" Proceedings of the IRE, pp. 1098-1110, September 1952). In this method, it is assumed that certain bytes occur more frequently in the file than others. For example, in English text the letter "t" or "T" is much more frequent than the letter "Q". Each byte is assigned a bit string, the length of which is inversely related to the relative frequency of that byte in the file. These bit strings are chosen to be uniquely decodeable if processed one bit at a time. Huffman derived an algorithm for optimally assigning the bit strings based on relative frequency statistics for the file.
The Huffman algorithm guarantees that asymptotically the compression achieved will approach the "entropy" of the file, which is precisely defined as: EQU H=SUM-[p(i) log.sub.2 (p(i) )], ##EQU1## The units of H are in bits, and it measures how many bits (on the average) are required to represent a character in the file. For example, if the entropy were 4.0 bits using an 8-bit byte, a Huffman compression system could achieve 2:1 compression on the file. The higher the entropy, the more "random" (and thus less compressible) is the data.
Huffman encoding works very well on many types of files. However, assignment of bit strings to bytes presents many practical difficulties. For example, if a pre-assigned encoding scheme is used (e.g., based on frequency of occurrence of letters in English), Huffman encoding may greatly expand a file if the pre-assigned scheme assumes considerably different frequency statistics than are actually present in the file. Additionally, computing the encoding scheme based on the file contents not only requires two passes over the data as well as applying the Huffman algorithm to the frequency statistics (thus lowering system throughput), but it also requires that the encoding table be stored along with the data, which has a negative impact on the compression ratio. Furthermore, the relative frequency of bytes can easily change dynamically within the file, so that at any point the particular encoding assignment may perform poorly.
There have been many variations on the Huffman approach (e g., Jones, Douglas W., "Application of Splay Trees to Data Compression" Communications of the ACM, pp 996-1007, Vol. 31, No. 8, August 1988) and they usually involve dynamic code assignment based on the recent history of input bytes processed. Such schemes circumvent the problems discussed above. Other approaches include looking at two byte words (bi-grams) at the same time and performing Huffman encoding on the words.
A recent variation of Huffman encoding is present in U.S. Pat. No. 4,730,348 to MacCrisken (and other patents referenced therein). In MacCrisken, Huffman codes are assigned to bytes in the context of the previous byte. In other words, a plurality of encoding tables are used, each table being selected according to the previous byte. This approach is based on the observation that, for example, in English the letter "u" does not occur very frequently, but following a "q" it appears almost always. Thus, the code assigned to "u" would be different depending on whether or not the previous letter was "q" (or "Q"). For a similar scheme using multiple tables and dynamic code assignment see, Jones, Douglas W., "Application of Splay Trees to Data Compression".
The above described Huffman-type approaches tend to be computationally intensive and do not achieve exceptionally high compression ratios. One explanation for this observation is that a pure Huffman code based on 8-bit bytes can achieve at best an 8:1 compression ratio, and only in the optimal situation when the file consists of the same byte repeated over and over (i.e. entropy=0). In the same scenario even a simple run-length encoding scheme could achieve better than a 50:1 compression ratio. The average performance will be some combination of best and worst case numbers, and limiting the best case must also limit the average. A well-known limitation of Huffman coding is that, if the probabilities are not exact powers of two, it cannot achieve the entropy, although it is guaranteed to come within one bit of the theoretical limit. This is due to the fact that all Huffman codes are an exact number of bits in length, while to achieve entropy in all cases would require fractional bit lengths. In other words, Huffman's algorithm suffers from rounding problems. In general, the problem worsens when there are tokens with high probabilities, since a fraction of a bit of "error" is a large percentage of the size of the code assigned.
Arithmetic coding is a well-known technique that can actually overcome the rounding problem. However, the tables required for arithmetic coding are not as compressible as Huffman tables, and performing the arithmetic algorithm dynamically to overcome the table size problem, while possible, is very computationally intensive. The net result is that the gains achieved in practice using arithmetic coding are not as large as would be hoped from a theoretical standpoint.
A totally different approach to compression was developed by Lempel and Ziv (see, Ziv, J. and Lempel, A., "Compression of Individual Sequences via Variable-Rate Coding", IEEE Transactions on Information Theory, Vol. IT-24, pp. 530-536, September 1978) and then refined by Welch (see, Welch, Terry A., "A Technique for High-Performance Data Compression", IEEE Computer, pp. 8-19, June 1984). Instead of assigning variable length codes to fixed size bytes, the Lempel-Ziv algorithm ("LZ") assigns fixed-length codes to variable size strings. As input bytes from the file are processed, a table of strings is built up, and each byte or string of bytes is compressed by outputting only the index of the string in the table. Typically this index is in the range 11-14 bits, and 12 bits is a common number because it lends itself to a simple implementation. Since the table is constructed using only previously encoded bytes, both the compression and the decompression system can maintain the same table without any extra overhead required to transmit table information. Hashing algorithms are used to find matching strings efficiently. At the start of the file, the table is initialized to one string for each character in the alphabet, thus ensuring that all bytes will be found in at least one string, even if that string only has length one.
The Lempel-Ziv algorithm is particularly attractive because it adapts itself to the data and requires no pre-assigned tables predicated on the file contents. Furthermore, since a string can be extremely long, the best case compression ratio is very high, and in practice LZ out-performs Huffman schemes on most file types. It is also quite simple to implement, and this simplicity manifests itself in high throughput rates.
There are also some drawbacks, however, to the LZ compression method. The LZ string search is a "greedy" algorithm. For example, consider the string: EQU ABCDEFBCDEF;
where A,B,C,D,E,F are any distinct bytes. Note that the LZ string search would add the following strings to its string table: AB, BC, CD, DE, EF, BCD, DEF, the only strings of length two or greater that can be output using this algorithm, up to the point shown, are BC and DE. In actuality the string BCDEF has already occurred in the input. Thus, while ideally the second BCDEF string would be referenced back to the original BCDEF, in practice this does not occur.
A more significant disadvantage to the LZ approach is that the string table for holding the compressed data will tend to fill up on long files. The table size could be increased, however, this approach would require more bits to represent a string and thus it would be less efficient. One approach to handling this deficiency would be to discard all or part of the table when it fills. Because of the structure of the algorithm, the most recently found strings have to be discarded first, since they refer back to previous strings. However, it is the most recent strings that have been dynamically adapting to the local data, so discarding them is also inefficient. Basically, the LZ string table has infinite length memory, so changes in the type of data within the file can cause great encoding inefficiencies if the string table is full.
It is also possible to design a compression system that utilizes more than one method simultaneously, dynamically switching back and forth depending on which method is most efficient within the file. From an implementation standpoint, such a scheme may be very costly (i.e., slow and/or expensive), however the resulting compression rate could be very high.
One such method of dynamically switching back and forth is disclosed in MacCrisken. As mentioned above, a bi-gram Huffman method is utilized as the primary compression technique. Typically the compression and decompression system start with a pre-defined (i.e., static) set of code tables. There may be a set of such tables, perhaps one each for English, French, and Pascal source code. The compression unit (sender) first transmits or stores a brief description of which table is to be used. The decompression unit (receiver) interprets this code and selects the appropriate table. During compression, if it is determined that the current table is not performing well, the sender transmits a special ("escape") Huffman code that tells the receiver to either select another specific pre-defined table or to compute a new table based on the previous data it has decompressed. Both sender and receiver compute the table using the same algorithm, so there is no need to send the entire table, although it takes some time to perform the computation. Once the new table is computed, compression proceeds as before. It should be noted that although there is considerable computational overhead, there is no reason why this technique could not be further adapted to a dynamic Huffman scheme.
In addition to the Huffman encoding, MacCrisken uses a secondary string-based compression method. Both sender and receiver maintain a history buffer of the most recently transmitted input bytes. For each new input byte (A), the bigram Huffman code is generated, but an attempt is also made to find the string represented by the next three input bytes (ABC) in the history using a hashing scheme. The hash is performed on three byte strings and a doubly-linked hash list is maintained to allow discarding of old entries in the hash list. If a string is found, a special Huffman escape code can be generated to indicate that a string follows, and the length and offset of the string in the history buffer is sent. The offset is encoded in 10 bits, while the length is encoded into 4 bits, representing lengths from 3-18 bytes. Before such a string is sent however, the compression unit generates the Huffman codes for all the bytes in the string and compares the size of the Huffman codes with the size of the string bits. Typically the Huffman string escape code is four bits, so it takes 19 bits to represent a string. The smaller of the two quantities is sent.
Note that the MacCrisken string method avoids the problems of the Lempel-Ziv method in that the string "table" never fills up, since the old entries are discarded by removing them from the hash list. Thus, only the most recent (within 1K bytes) strings occupy the table. Also it is not "greedy" since in principle all matching strings can be found. In practice, a limit on the length of the string search is imposed. Additionally, the MacCrisken method is computationally inefficient because it is effectively performing two compression algorithms at once, and thus the computational overhead is quite high.
Other algorithms exist which use a variant of the MacCrisken variation of the Lempel-Ziv technique of maintaining a "sliding window" of the most recent processed bytes of data and scanning the window for strings of matching bytes. If a string is found, the length of the matching string and its offset within the window are output; otherwise, a "raw" byte is output. The encoder portion of the compression engine emits a tag to distinguish between strings and raw bytes, and the strings and raw bytes themselves may be encoded in many ways.
Obviously, since various types of data will have different distributions of string lengths and offsets, a single fixed encoding cannot be optimal for all possible files. Thus, various techniques have been developed to determine the encoding based on the strings found. For example, Huffman coding can be used to encode the string lengths and offsets. In practice, not all lengths and offsets are given an individual Huffman code. Instead, ranges of lengths and offsets may be represented by a single Huffman code, with extra bits following the Huffman code to distinguish between values within the range. These ranges, or bins, are chosen to approximate the distributions typically observed in data.
The extremely compelling advantage of such an approach is that the encoding can be optimized, within the constraints of the bins chosen, for the data being processed so as to minimize the size of its compressed image. One disadvantage of such an approach is that a table of some type describing the encoding format must be sent along with the data, thus counteracting to some extent the extra compression gained by the variable encoding. In practice, for large enough data blocks, this overhead is more than compensated for by the gains in encoding. Another disadvantage is that this type of approach is inherently more complex to implement, whether in hardware or software, than a fixed encoding scheme. Again, the gain in compression ratio often is more important than the increase in complexity. It is possible to modify the encoding dynamically as each byte of data is processed, removing the need for a table, but such a scheme is considerably more complex, typically slowing compression and decompression throughput dramatically without a corresponding dramatic gain in compression ratio. A third disadvantage, which is not significant in many cases, is that this type of algorithm is essentially a two-pass approach, requiring all the data to be processed by the string search engine before any encoded tokens can be output.
In addition to encoding the strings, raw bytes may also be encoded. Using sliding window methods, every item output is either a string or a raw byte, so the raw bytes and strings may be encoded together. For example, a single Huffman code may represent either a raw byte or a string of certain length. Including raw bytes in the encoding tends to grow further the size of the table which specifies the particular encoding used, but this increase in table size is typically overcome by the resulting gain in compression.
PKZIP version 2.0 and LHA version 2.13 are commonly available compression utilities for MS-DOS computers that use this type of compression method. Although the string searching techniques used by these programs are different, the resulting compression formats are extremely similar in style. Not surprisingly, very similar compression ratios result. Each program uses a sliding window and a minimum string length of three, and generates two Huffman tables that are stored as part of the compressed data. The first (and largest) Huffman table encodes raw bytes and string lengths. For example, PKZIP assigns Huffman codes 0-255 to raw bytes, and Huffman codes 257-285 to string lengths from 3 to 258, with a total of 29 length bins of various sizes.
A second Huffman table is employed by PKZIP and LHA to represent the string offsets, once the string length is specified. In other words, after the Huffman code corresponding to a string length (as opposed to a raw byte), a different Huffman code is used to specify the string offset. PKZIP has Huffman codes for 30 offset bins ranging from 1 to 32768, while LHA has 13 offset bins ranging from 1 to 8191. These algorithms are most effective when compressing blocks of data which are 8K bytes or more in size, so that the table overhead as a fraction of block size is minimized.
In these products, the Huffman tables are themselves stored in a compressed form, relying on the well-known fact that, given only the lengths of codes generated by Huffman's algorithm, it is possible to generate and assign a unique set of Huffman codes. Thus, only the lengths of the Huffman codes need to be stored, resulting in a table which is considerably smaller (and more compressible) than the codes themselves. In fact, the Huffman lengths are compressed using Huffman coding, so there is actually an initial (uncompressed) Huffman table which is used to extract the Huffman lengths, which are in turn used to generate the Huffman codes used in compressing and decompressing the data.
Typically, these approaches can compress data to a size that is 10-15% smaller than fixed encoding techniques. Much of the literature and research in data compression has focused more on string search methods than on encoding techniques, but it is clear empirically that considerable gains can be achieved (at a cost in complexity) strictly by concentrating on how encoding is performed. Even ignoring the complexity aspect, however, fixed encoding is still important for many applications where tables cannot be sent. For example, in many communication systems, small packets of data (often less than 100 bytes) must be compressed. The table overhead would be significant in this case. Similarly, in some applications, the data must be compressed and transmitted as it is received, without waiting for an entire block to be received so that a table could be generated.
A major portion of the gain in compression ratio using a variable encoding scheme comes from the variable coding itself, which adapts to the distribution of raw bytes and strings. However, another important component of the gain is attributable to the larger window size (e.g., 8K bytes and above) afforded by the variable coding. Larger windows allow more strings to be found, since more history is available for string searching. For fixed encoding schemes, unfortunately, the increase in the encoded size of offsets tends to negate the fact that more strings are found, while with variable encoding schemes the extra strings will increase the overall compression ratio due to the adaptiveness of the offset encoding.
From an implementation standpoint, one problem with larger window sizes is that the cost of hardware required may be prohibitive, particularly if the entire compression and decompression engines are to be placed on a single integrated circuit. Similarly, software implementations usually require memory size proportional to the window size, and this may be unacceptable in some instances. In any case it is normally desirable to have compatible software and hardware versions of a compression algorithm. The cost and speed of both hardware and software must be taken into account, as well as the compression ratio achievable with the algorithm.