1. Field of the Invention
This invention relates in general to data processing systems and, more specifically, to a system and method for compressing data.
2. Background of the Invention
In general, data compression involves taking a stream of symbols and transforming them into codes. If the compression is effective, the resulting stream of codes will be smaller than the original symbol stream. The decision to output a certain code for a certain symbol or set of symbols is based on a model. The model is simply a collection of data and rules used to process input symbols and determine which code(s) to output. A computer program may use the model to accurately define the probabilities for each symbol in order to produce an appropriate code based on those probabilities.
Data compression techniques often need to be lossless (without incidences of error or loss of data). Exceptions to this include, for example, certain applications pertaining to graphic images or digitized voice. Lossless compression consists of those techniques guaranteed to generate an exact duplicate of the input data stream after a compress/expand cycle. This is the type of compression often used when storing database records, spreadsheets, or word processing files. In these applications, the loss of even a single bit can be catastrophic.
Lossless data compression is generally implemented using one of two different types of modeling: statistical or dictionary-based. Statistical modeling reads in and encodes a single symbol at a time using the probability of that character's appearance. Statistical models achieve compression by encoding symbols into bit strings that use fewer bits than the original symbols. The quality of the compression goes up or down depending on how good the program is at developing a model. The model has to predict the correct probabilities for the symbols. The farther these probabilities are from a uniform distribution, the more compression that can be achieved.
In dictionary-based modeling, the coding problem is reduced in significance, making the model supremely important. The dictionary-based compression processes use a completely different method to compress data. This family of processes does not encode single symbols as variable-length bit strings; it encodes variable-length strings of symbols as single pointers. The pointers form an index to a phrase dictionary. If the pointers are smaller than the phrases they replace, compression occurs. In many respects, dictionary-based compression is easier for people to understand. In every day life, people use phone numbers, Dewey Decimal numbers, and postal codes to encode larger strings of text. This is essentially what a dictionary-based encoder does.
In general, dictionary-based compression replaces phrases with pointers. If the number of bits in the pointer is less than the number of bits in the phrase, compression will occur. However, the methods for building and maintaining a dictionary are varied.
A static dictionary is built up before compression occurs, and it does not change while the data is being compressed. For example, a database containing all motor-vehicle registrations for a state could use a static dictionary with only a few thousand entries that concentrate on words such as "Ford," "Jones," and "1994." Once this dictionary is compiled, it is used by both the encoder and the decoder as required.
There are advantages and disadvantages to static dictionaries. Nevertheless, dictionary-based compression schemes using static dictionaries are mostly ad hoc, implementation dependent, and not general purpose.
Many of the well-known dictionary-based processes are adaptive. Instead of having a completely defined dictionary when compression begins, adaptive schemes start out either with no dictionary or with a default baseline dictionary. As compression proceeds, the processes add new phrases to be used later as encoded tokens.
For a further discussion of data compression in general, please refer to The Data Compression Book, by Mark Nelson, .COPYRGT. 1992 by M&T Publishing, Inc., which is hereby incorporated by reference herein.
As mentioned, the history of past symbols of a sequence often provides valuable information about the behavior of the sequence in the future. Various universal techniques have been devised to use this information for data compression or prediction. For example, the Lempel-Ziv ("LZ") compression process, which is discussed within Compression of Individual Sequences by Variable Rate Coding, by J. Ziv and A. Lempel, IEEE Trans. Inform. Theory, IT-24: 530-536, 1978 (which is incorporated by reference herein), uses the past symbols to build up a dictionary of phrases and compresses the string using this dictionary. As Lempel and Ziv have shown, this process is universally optimal in that the compression ratio converges to the entropy for all stationary ergodic (of or related to a process in which every sequence or sizeable sample is equally representative of the whole) sequences. Thus, given an arbitrarily long sequence, such compression operates as well as if the distribution of the sequence was known in advance.
The Lempel-Ziv compression method has achieved great popularity because of its simplicity and ease of implementation (actually, Lempel-Ziv is often used to denote any dictionary based universal coding scheme, as a result, the standard method described herein is only one of this large class). It asymptotically achieves the entropy limit for data compression. However, the rate of convergence may be slow and there is scope for improvement for short sequences. In particular, at the end of each phrase, the process returns to the root of the phrase tree, so that contextual information is lost. One approach to this problem was suggested by Plotnik, Weinberger and Ziv for finite state sources, as described within Upper Bounds on the Probability of Sequences Emitted by Finite-State Sources and on the Redundancy of the Lempel-Ziv Algorithm, by E. Plotnik, M. J. Weinberger and J. Ziv, IEEE Trans. Inform. Theory, IT-38(1): 66-72, January 1992, which is incorporated by reference herein. Their idea was to maintain separate LZ like trees for each source state (or estimated state) of a finite state model of the source. Plotnik, Weinberger and Ziv showed that this procedure is asymptotically optimal.
U.S. patent application Ser. No. 08/253,047, filed on Jun. 2, 1994 and assigned to the same assignee as the present invention describes a compression system wherein contextual information is implemented in conjunction with a dictionary-based compression process within a data processing system. Before encoding a next phrase within an input string of data, the compression system, through dynamic programming, derives the statistically best set of dictionaries for utilization in encoding the next phrase. The selection of this set of dictionaries is dependent upon each dictionary's past utilization within the process. Thus, the choice of dictionaries is directly related to each of their performance on compression, and not on a presumed model for the sequence as is required in a compression technique utilizing a state machine description of the source, which is available to both the encoder and decoder. The selected set of dictionaries is implemented by the data processing system to encode the next phrase of data. The context of the phrase is used to select a particular dictionary within the selected set for the encoding process, which computes a pointer to the phrase within the dictionary that corresponds to the next phrase.
The above-technique has the property that the compressor and de-compressor can each compute from the preceding string, the dictionary which will be used for the next phrase. This has the advantage that the identity of the dictionary need not be included. However, the chosen dictionary may not in fact be the best one to use, as the decision of which to use is based on estimates. Further, in some cases improved compression could result from using other types of compression techniques.