The present invention relates to data compression methods which are used to improve the way information is stored on a digital computer, or transmitted over communication channels.
The primary benefit of using a data compression system is to reduce storage space, which may lead to significant savings. For instance, compressing all the files on the hard disk of a PC may delay the need to purchase a new, larger, hard disk. Similarly, a large data base and the auxiliary files needed to turn it into a full text information retrieval system, may fit onto a single CD-Rom only after suitable compression has been applied.
In addition to reducing required storage space, data compression methods can also lead to savings in processing time. Although the use of compression implies a certain overhead for decompressing the data when the data is accessed, the rate determining steps, or bottlenecks, of most systems are still the relatively slow I/O operations. In many cases, the overhead incurred in the decompression process is largely compensated by the savings in the number of read operations from external storage devices.
The present invention involves lossless compression which are fully reversible methods, allowing the reconstruction of the original data without the loss of a single bit. Not all compression methods are lossless. For example, most image compression techniques are lossy, involving the discarding of a significant portion of the data and thereby yielding generally much higher compression ratios than are possible with lossless techniques.
Many compression methods have been proposed. One well known method was offered by Huffman, (see Huffman D. A., A Method for the Construction of Minimum Redundancy Codes, Proceedings of the I.R.E., Vol. 40 (1952), pp 1098-1110). The Huffman method is statistical and is based on the idea that frequently occurring characters are encoded by shorter codewords than rare characters. Huffman's algorithm discloses how to assign, under certain constraints, such codewords in an optimal way once the character distribution is given or has been determined.
Widely used are the various dictionary-based compression systems. These use a list, called a "dictionary", of variable length strings, such as frequent words or word fragments. Compression is achieved by replacing in the text to be processed the occurrences of strings which can be found in the dictionary by a pointer, which is shorter than the string, to the corresponding entry in the dictionary.
Many modern data compression techniques are based on the pioneering works of A. Lempel and J. Ziv. often referred to as LZ methods. Two such methods are disclosed in Ziv J., Lempel A., A Universal Algorithm for Sequential Data Compression, IEEE Trans. on Inf. Th. IT-23 (1977) 337-343 (hereinafter "LZ77"), and Ziv J., Lempel A., Compression of Individual Sequences via Variable Rate Coding, IEEE Trans. on Inf. Th. IT-24 (1978) 530-536 (hereinafter "LZ78"). The innovation of LZ77 and LZ78, which are both dictionary-based, is that they build the dictionary adaptively while scanning the text, by using fragments of the text itself.
In the LZ77 method the dictionary is, in fact, the previously scanned text, rather than a separately stored dictionary, thereby obviating the need to store an explicit dictionary for this case. A pointer is used which is of the form (d,l), where d is the offset (i.e., the number of characters from the current location to the previous occurrence of the substring starting at the current location), and l is the length of the matching substring.
LZ78 forms the basis of the LZW method described in Welch T. A., High Speed Data Compression and Decompression Apparatus and Method, U.S. Pat. No. 4,558,302, Dec. 10, 1985. In the LZW method each of the strings in the dictionary is obtained from one of the earlier elements by appending a new character to its right, such that the extended string matches the string currently processed.
A common difficulty of the various dictionary-based methods involves the efficient location of previous occurrences of substrings in the text. The method of backwardly scanning the entire text for each character to be processed is normally unacceptably slow.
Many schemes for more efficiently locating substrings have been suggested. These include the use of binary trees, as in Bell T. C., Better OPM/L Text Compression, IEEE Trans on Communications, COM-34 (December 1986) 1176-1182; the use of hashing, as in Brent R. P., A Linear Algorithm for Data Compression, The Australian Computer Journal 19 (1987) 64-68, and Gibson & Graybill, Apparatus and Method for Very High Data Rate-Compression Incorporating Lossless Data Compression and Expansion Utilizing a Hashing Technique, U.S. Pat. No. No. 5,049,881, Sep. 17, 1991; and the use of Patricia trees, as in Fiala & Greene, Textual Substitution Data Compression with Finite Length Search Windows, U.S. Pat. No. No. 4,906,991, Mar. 6, 1990.
The question of how to parse the original text into a sequence of substrings is a problem which is common to all dictionary-based compression methods. Generally, the parsing is done by a "greedy" method, i.e., a method which, at each stage, seeks the longest matching element from the dictionary. While greedy methods have the advantage of speed, they do not always yield optimal parsing.
Because the elements of the dictionary are often overlapping (which is particularly true of LZ77 variants where the text, which also serves as the dictionary, consists of numerous overlapping fragments), a different way of parsing may, under some circumstances yield better compression. For example, assume that the dictionary, D, consists of the strings {abc, ab, cdef, de, f} and that the text, T, is `abcdef`; assume further that the elements of D are encoded by some fixed-length code, which means that, log.sub.2 (.vertline.D.vertline.) bits are needed to refer to any of the elements of D, where .vertline.D.vertline. denotes the number of elements in D. Then, parsing T using a greedy method, which tries always to match the longest available string, yields abc-de-f, which requires three codewords, whereas a better partition would have been ab-cdef, which requires only two codewords.
There is thus a widely recognized need for, and it would be highly advantageous to have, a dictionary-based data compression technique capable of optimally parsing a text in a manner which is not prohibitively slow.