1. Field of the Invention
The present invention relates generally to data processing and more particularly to digital data compression.
2. Description of Prior Art
Information processing systems and data transmission systems frequently need to store large amounts of digital data in a mass memory device or to transfer large amounts of digital data using a resource which may only carry a limited amount of data at a time, such as a communications channel. Therefore, approaches have been developed to increase the amount of data that can be stored in memory and to increase the information carrying capacity of capacity-limited resources. Most conventional approaches to realizing such increases are costly in terms of equipment or monetary expense, because they require the installation of additional resources or the physical improvement of existing resources. Data compression, in contrast with other conventional approaches, provides such increases without incurring large costs. In particular, it does not require the installation of additional resources or the physical improvement of existing resources.
Data compression methods and apparatuses remove redundancy from a data stream, while still preserving the information content. The data compression methods and apparatuses which are of the greatest interest are those which are fully reversible, such that an original data stream may be reconstructed from compressed data without any loss of information content. Techniques, such as filtering, which are not fully reversible, are sometimes suitable for compressing visual images or sound data. They are, nevertheless, not suitable for compression of program image files, textual report files and the like, because the information content of such files must be preserved exactly.
There are two major goals in digital data compression. The first goal is to maximize compression by using the fewest possible bits to represent a given quantity of input data. The second goal is to minimize the resources required to perform compression and decompression. The second goal encompasses such objectives as minimizing computation time and minimizing the amount of memory required to compress and decompress the data. Data compression methods of the prior art typically achieve only one of these goals.
There are two major families of data compression methods currently in use. Both of these families are derived from methods developed by Ziv and Lempel. The first family of methods is based on a method of Ziv which will be referred to hereinafter as LZ78. This method is described in detail in Ziv et al., "Compression of Individual Sequences Via Variable-Rate Coding," IEEE Transactions on Information Theory, IT-24-5, September, 1978, pp 530-537. The second family of methods is based on another method of Ziv which will be referred to hereinafter as LZ77. This method is described in detail in Ziv et al., "A Universal Algorithm for Sequential Data Compression," IEEE Transactions on Information Theory, IT-23-3, May, 1977, pp 337-343.
For the purpose of conveniently manipulating digital data, the data is usually divided into symbols, such as binary words, bytes or ASCII characters. LZ78 and LZ77 compress an input data stream of symbols by dividing the input data into substrings of symbols, and then replacing the substrings with short codes representing those substrings. So that a compressed data stream may be decompressed, a dictionary equating each substring with a code which replaces it is built as substrings are replaced. The division of the input data stream into substrings is performed so that each substring is the longest string for which there is an identical substring in the input data stream.
LZ78-based methods build the dictionary from substrings for which matches have been found. The codes used in the output data to represent substrings of the input data are simply indexes into this dictionary. As noted above, substrings selected for placement in the dictionary are the longest substrings for which a matching substring may be found earlier in the input data. The most popular derivative of LZ78 is Lempel-Ziv-Welch, (hereinafter referred to as LZW) which is described in U.S. Pat. No. 4,558,302 issued to Welch on Dec. 10, 1985.
LZ78-based methods, including LZW, are very popular because of their high speed of compression. However, the primary disadvantage of LZ78-based methods is that they require large amounts of memory to hold the input data and the dictionary. Solutions to these problems with LZ78-based methods are suggested by Miller, U.S. Pat. No. 4,814,740 and Clark et al., International Patent Application PCT/GB89/00752. These references concern methods for limiting the complexity of the tree structures which are often used to find the longest matching substrings in the input data.
In contrast with building an independent dictionary on the basis of matches found, LZ77-based methods use the previously compressed input data as the dictionary. Therefore, a buffer memory is reserved for retaining some portion of the previously compressed data which will be used as the dictionary. In these methods, the codes which replace the substrings are pointers to matching substrings held in the buffer. As in LZ78-based methods, the replacement codes represent the longest available previous occurrence of a matching substring. The contents of the buffer memory determine availability in this context.
The pointers which replace substrings of the input data each comprise an ordered pair of values representing an offset and a length. The offset indicates the number of symbols between the substring replaced by the pointer and the substring to which the pointer points, while the length indicates the number of symbols in the substring replaced by the pointer.
Since any embodiment of an LZ77-based method must have a finite amount of memory, the range of values representable by the offset and length is limited. Thus, LZ77-based methods differ from each other in two parameters, N, the maximum offset distance a pointer may represent, and F, the maximum length of a substring that may be replaced by a pointer. The parameter N defines a window of available input data which is used as the dictionary. In particular, the dictionary contains only the input data which is within the maximum offset N of the substring currently being compressed or decompressed. The contents of the dictionary are continuously replenished from the input data as data is manipulated.
A derivative of the LZ77 algorithm was suggested by Storer and Szymanski. Their observation was that a pointer is sometimes longer than the substring it replaces. Thus, their suggestion, hereinafter denoted as LZSS, was to use literal symbols taken directly from the input stream whenever a pointer would take up more space than the substring it replaces. A flag bit is then added to each pointer and symbol to distinguish pointers and symbols each from the other.
LZSS-based methods provide excellent data compression, generally better than LZW, but also require significant computation time. This is caused by the well known, maximal matching substring problem, which is at the heart of LZ77-based compressors. In the context of LZ77-based data compressors, this problem calls for finding the longest substring in the dictionary, which matches the input data stream.
There have been many attempts to solve this problem, including for example, that of Brent as taught in "A Linear Algorithm for Data Compression," The Australian Computer Journal, Volume 19, Number 2, May 1987. However, the solution of Brent fails to achieve the best possible results of fast compression with a high compression ratio. Brent uses a hashing technique to quickly locate potential matches in the history buffer. However, Brent's method operates by hashing most of the substrings of the dictionary, in order to find the maximal matching substring all of the time. This is both time-consuming and memory intensive, as the hash table must be capable of pointing anywhere in the history buffer.
Furthermore, practical embodiments of Brent's method use a history buffer having a finite size. In such an embodiment, Brent's method will not always find the most recent, maximal match.
Thus, a general object of the present invention is to provide a method and apparatus for data compression that yields excellent compression for a variety of input data types including executable run files, report files, and document files while using a minimum of computation time.
Another object of the present invention is to provide a method and apparatus for data compression that uses a minimal amount of memory.
Yet another object of the present invention is to provide a method and apparatus for data compression that achieves the compression by means of an efficient solution to the maximal matching substring problem.