With the advancement of computer technology, large scale information transfer by remote computing and the development of massive information storage and retrieval systems have witnessed a tremendous growth. The growth of these systems has created a need for efficient mechanisms for the storage and transfer of enormous volumes of data. Accordingly, data compression and decompression techniques have been developed which reduce the redundancy in data representation in order to decrease data storage requirements and data transfer costs. In particular, the data compression techniques transform a body of data into a smaller form from which the original, or some approximation of the original, can be recovered at a later time. There are at least two types of data compression: (1) "lossless" data compression, where the data that is compressed and subsequently decompressed is identical to the original data; and (2) "lossy" data compression, where the decompressed data is some approximation of the original data. The present invention is primarily directed to the former, lossless, data compression technique.
Several data compression algorithms of different philosophy, complexity and application scope have been developed to reduce the redundancy in data representation. Such algorithms include: (i) the Huffman method, (ii) the adaptive Huffman method, (iii) the multi-group compression method, (iv) run-length encoding, (v) the header compression method, (vi) the LZW algorithm, (vii) arithmetic coding, and (viii) dictionary-based methods. Further, a technique for enhancing the arithmetic and Huffman coding methods has been proposed by the inventors of the present invention. (See Bassiouni, M., Mukherjee, A., and Ranganathan, N. "Enhancing Arithmetic and Tree-Based Coding" Journal of Information Processing and Management, Vol. 25, No. 1, 1989).
One particularly useful algorithm for the compression and decompression of text and/or image data has been proposed by Lempel and Ziv in 1977 (hereinafter the "LZ" technique). A brief discussion of the technique follows, and for the purposes of this invention, the following terminology is defined: An "alphabet" is a finite set containing at least one element. The elements of an alphabet are called "characters" or "symbols". A "string" over an alphabet is a sequence of characters, each of which is an element of that alphabet. All strings are assumed to be of finite length unless otherwise stated. A "substring" is a part of a string. This term is generally used to denote the part of the string that matched. The degree of data reduction obtained as a result of the compression is the "compression ratio". This ratio measures the quantity of compressed data in comparison to the quantity of the original data and is given by: ##EQU1## Percentage compression (% compression) gives an estimate of how much compression is achieved. It is given by: ##EQU2##
The LZ technique proposed by Lempel and Ziv for data compression involves two basic steps: (i) parsing and (ii) coding. In the "parsing" step, a string of symbols is split into substrings of variable length according to certain rules. In the "coding" step, each substring is coded sequentially into a fixed length code. A mathematical discussion of the technique is described in Ziv, J. and Lempel, A., "A Universal Algorithm for Sequential Data Compression" IEEE Trans. on Info Theory, Vol. IT-23, No. 5, 1977 p. 337-343; and Ziv, J. and Lempel, A. "Compression of Individual Sequences via Variable Rate Coding" IEEE Trans. on Info Theory, Vol. IT-24, No. 1978 pp. 530-536.
According to the LZ technique, a buffer of a preselected length is chosen, for example a buffer of length 18. The first half of the buffer contains the symbols already coded, and the second half contains the symbols that are yet to be coded. An alphabet set of a preselected number of symbols can be used, i.e., 0, 1 and 2; and let "S" be the string to be compressed. For the purposes of this example, let S=010210211010210212.