The information age is upon us and, more and more, computers are absorbing the workload of gathering, storing, and manipulating this information. A problem arises since various forms of information require significant amounts of data storage, which is a resource that is often expensive and/or scarce. Further, the transmission of a great amount of information, or data, often requires a considerable amount of time, even at the high data transfer rates available with current data processing systems.
As a result, data compression is a valuable tool for conserving memory and accelerating the data transfer process. However, data compression techniques need to be lossless (without incidences of error or loss of data), except for applications pertaining to graphic images or digitized voice. Lossless compression consists of those techniques guaranteed to generate an exact duplicate of the input data stream after a compress/expand cycle. This is the type of compression often used when storing database records, spreadsheets, or word processing files. In these applications, the loss of even a single bit can be catastrophic.
In general, data compression consists of taking a stream of symbols and transforming them into codes. If the compression is effective, the resulting stream of codes will be smaller than the original symbol stream. The decision to output a certain code for a certain symbol or set of symbols is based on a model. The model is simply a collection of data and rules used to process input symbols and determine which code(s) to output. A program uses the model to accurately define the probabilities for each symbol in order to produce an appropriate code based on those probabilities.
Data compression enters into the field of information theory (information theory is a branch of mathematics that concerns itself with various questions about information, including different ways of storing and communicating messages) because of its concern with redundancy. Redundant information in a message takes extra bits to encode, and if this extra information can be removed, the size of the message may be reduced.
Information theory uses the word "entropy" as a measure of how much information is encoded in a message. The higher the entropy of a message, the more information it contains. The entropy of a symbol is defined as the average of the negative logarithm of its probability. To determine the information content of a message in bits, entropy is expressed using the base 2 logarithm: EQU Number of bits=-Log base 2 (probability).
The entropy of an entire message is simply the sum of the entropy of all individual symbols.
Entropy fits with data compression in its determination of how many bits of information are actually present in a message. If the probability of a character "e" appearing in this document is 1/16, for example, and the information content of the character is 4 bits, then the character string "eeeee" has a total content of 20 bits. If standard 8-bit ASCII characters are used to encode this message, then 40-bits are actually used. The difference between the 20-bits of entropy and the 40-bits used to encode the message is where the potential for data compression arises.
Using an automotive metaphor for data compression, coding would be the wheels, but modeling would be the engine. Regardless of the efficiency of the coder, if it does not have a model feeding it good probabilities, it will not compress data.
Lossless data compression is generally implemented using one of two different types of modeling: statistical or dictionary-based. Statistical modeling reads in and encodes a single symbol at a time using the probability of that character's appearance. Statistical models achieve compression by encoding symbols into bit strings that use fewer bits than the original symbols. The quality of the compression goes up or down depending on how good the program is at developing a model. The model has to predict the correct probabilities for the symbols. The farther these probabilities are from a uniform distribution, the more compression that can be achieved.
Dictionary-based modeling uses a single code to replace strings of symbols. In dictionary-based modeling, the coding problem is reduced in significance, making the model supremely important. The dictionary-based compression processes use a completely different method to compress data. This family of processes does not encode single symbols as variable-length bit strings; it encodes variable-length strings of symbols as single pointers. The pointers form an index to a phrase dictionary. If the pointers are smaller than the phrases they replace, compression occurs. In many respects, dictionary-based compression is easier for people to understand. In every day life, people use phone numbers, Dewey Decimal numbers, and postal codes to encode larger strings of text. This is essentially what a dictionary-based encoder does.
In general, dictionary-based compression replaces phrases with pointers. If the number of bits in the pointer is less than the number of bits in the phrase, compression will occur. However, the methods for building and maintaining a dictionary are varied.
A static dictionary is built up before compression occurs, and it does not change while the data is being compressed. For example, a database containing all motor-vehicle registrations for a state could use a static dictionary with only a few thousand entries that concentrate on words such as "Ford," "Jones," and "1994." Once this dictionary is compiled, it is used by both the encoder and the decoder as required.
There are advantages and disadvantages to static dictionaries. Nevertheless, dictionary-based compression schemes using static dictionaries are mostly ad hoc, implementation dependent, and not general purpose.
Most well-known dictionary-based processes are adaptive. Instead of having a completely defined dictionary when compression begins, adaptive schemes start out either with no dictionary or with a default baseline dictionary. As compression proceeds, the processes add new phrases to be used later as encoded tokens.
For a further discussion of data compression in general, please refer to The Data Compression Book, by Mark Nelson, .COPYRGT. 1992 by M&T Publishing, Inc., which is hereby incorporated by reference herein.
As mentioned, the history of past symbols of a sequence often provides valuable information about the behavior of the sequence in the future. Various universal techniques have been devised to use this information for data compression or prediction. For example, the Lempel-Ziv ("LZ") compression process, which is discussed within Compression of individual Sequences by Variable Rate Coding, by J. Ziv and A. Lempel, IEEE Trans. Inform. Theory, IT-24:530-536, 1978 (which is incorporated by reference herein), uses the past symbols to build up a dictionary of phrases and compresses the string using this dictionary. As Lempel and Ziv have shown, this process is universally optimal in that the compression ratio converges to the entropy for all stationary ergodic (of or related to a process in which every sequence or sizeable sample is equally representative of the whole) sequences. Thus, given an arbitrarily long sequence, such compression operates as well as if the distribution of the sequence was known in advance.
The Lempel-Ziv compression method has achieved great popularity because of its simplicity and ease of implementation (actually, Lempel-Ziv is often used to denote any dictionary based universal coding scheme, as a result, the standard method described herein is only one of this large class). It asymptotically achieves the entropy limit for data compression. However, the rate of convergence may be slow and there is scope for improvement for short sequences. In particular, at the end of each phrase, the process returns to the root of the phrase tree, so that contextual information is lost. One approach to this problem was suggested by Plotnik, Weinberger and Ziv for finite state sources, as described within Upper Bounds on the Probability of Sequences Emitted by Finite-State Sources and on the Redundancy of the Lempel-Ziv Algorithm, by E. Plotnik, M. J. Weinberger and J. Ziv, IEEE Trans. Inform. Theory, IT-38(1): 66-72, January 1992, which is incorporated by reference herein. Their idea was to maintain separate LZ like trees for each source state (or estimated state) of a finite state model of the source. Plotnik, Weinberger and Ziv showed that this procedure is asymptotically optimal.
However, Plotnik, Weinberger and Ziv did not provide any procedure for finding the set of trees to use for the compression. They assumed that the state machine description of the source is available to both the encoder and decoder, and that the state of the machine is known by both parties. However, in general, one does not know the state machine description of the source. Additionally, even if one knew the state description of the source, using separate dictionaries for every state could perform worse than a common dictionary compression scheme if the number of states is very large.
Thus, there is a need in the art for an improved data compression technique implemented within a data processing system which has a faster rate of convergence. There is a further need in the art for a data compression technique that is more effective for short sequences. And, there is a need in the art for an improved data compression technique that utilizes contextual information in achieving the compression of the data.