Techniques for compressing data are commonly used in the communications and computer fields. In communications, it is often desirable to transmit compressed strings of data which, upon reception, can be reconstructed into their original form. Transmitting compressed data always takes less time than transmitting the same data in an uncompressed format. In the computer field, compressed data offers a storage advantage over non-compressed data. Thus, for a storage device having a fixed storage capacity, more files can be stored therein if they are first compressed. Accordingly, the two main advantages for compressing data are increased storage capacity and decreased transmission time.
Data compression techniques can be divided into two major categories: lossy and lossless. Lossless data compression techniques are employed when it is imperative that no information is lost in the compression/decompression process. Lossy data techniques are less accurate than lossless techniques but they generally are much faster. In short they sacrifice some accuracy for speed in the data compression/decompression cycle. Lossy data compression techniques are typically employed in those processing applications (such as the transmission and storage of digitized video and audio data) that can tolerate some information loss. Lossy data compression typically yield greater degrees of data compression and quicker compression processing than lossless data compression techniques. Lossy data compression techniques have recently gained substantial importance in view of the growing popularity of audio and video applications made available to personal computers and related markets. The vast majority of all other applications employ lossless data compression techniques for the compression of data.
By definition, lossless compression techniques employ methods that guarantee the precise duplication of data after it has passed through the compression/decompression cycle. Lossless compression is most commonly associated with storage of digital data used in conjunction with a computer. Such applications include the storage of data base records, spread sheet information, word processing files, etc.
At their very core, all data compression techniques are linked to, and employ a branch of mathematics known as Information Theory. This branch of mathematics concerns itself with questions regarding the expressing, storing, and communicating of information.
Data compression is linked to the field of Information Theory because of its concern with redundancy. If information in a data message is redundant (its omission does not reduce, in any way, the information encoded in the data) the message can be shortened without losing any portion of the information encoded therein. Thus lossless data compression reduces the size of the message without compromising the integrity of the information conveyed by the message.
Entropy is a term used to convey the measure of how much information is encoded in a message. A message having a high degree of entropy contains more information than a message of equal length having low entropy. The entropy of a symbol in a message is defined as the negative logarithm of its probability of occurrence in that message. To determine the information content of a character in bits, we express the entropy using base two logarithms as follows:
E.sub.char (x)=-log.sub.2 (probability of char(x)) PA1 E.sub.char (x)=Entropy of a given character in a message. PA1 probability of char(x)=Probability of char(x) occurrence in that message.
where:
The entropy of an entire message is simply the sum of the entropy of each character (or symbol) that is found in that message.
The concept of entropy guides us in our quest for optimizing data compression techniques because it, theoretically, determines how many bits of information are actually present in a message. If, in a given message, the character "Q" has a 1/16th probability of appearing, the information content carried by that character is 4 bits. The information content carried by a string of 2 "Qs" is 8 bits, etc. If we are using standard 8-bit ASCII characters to encode our "QQ" string, we need 16 bits. The difference between the 8 bits of entropy and the 16 bits used to encode our string is where the potential for data compression arises.
Entropy, as it is used in information theory, is only a relative measure of information content, and can never be an absolute measure. This is because the information content of a character is based on that character's probability of occurrence in a given message. For two different messages, both of which contain at least one occurrence of the letter "E", the probability of the occurrence of "E" will, most likely, differ between the two messages. Thus, the information content of the character "E" is not a fixed value but varies in value from message to message in proportion to its probability. Most data compression techniques focus on predicting symbols (or characters) within a given message that occur with high probabilities. A symbol that has a high probability, necessarily has a low information content and will require fewer bits to encode than low probability symbols. Different techniques are known for establishing the probability of a given character's occurrence. For textual data the simplest approach is to establish (empirically) the probability of each character's occurrence and assign a binary code to each character wherein the length of the binary code is inversely proportional to the character's probability of occurrence (i.e., shortest binary codes are assigned to the characters that appear with the highest frequency).
Dictionary based techniques use a slightly different approach in that a portion, or portions, of the data is first scanned to determine which characters, or character strings, occur most frequently. The characters, and character strings, are placed in a dictionary and assigned a predetermined code having a code length inversely proportional to the character's, or character string's, probability. The characters and character strings are read from the data file, matched up with their appropriate dictionary entry, and coded with the appropriate code.
Recently, data compression software has proliferated in the DOS community. This software suffers from a number of drawbacks. Firstly, programs are typically disc-intensive and consequently, their performance is tied closely to the speed with which one can read and write to the disc. For example, a popular computer compaction program known as PKZIP.TM. operating on a 25-MHZ 386 GATEWAY 2000.TM. with a hard disc having an 18 millisecond random access speed, takes 8.5 seconds to compress a 1-megabit ASCII file to 1/2 of its original size. Data base and spread sheet files take approximately the same amount of time, but they can be reduced by as much as two-thirds of their original size. Binary files shrink the least--generally between 10 and 40 percent of their original size--but require six times longer to compress than ASCII files of comparable length.
In view of the above deficiencies of known data compression methods, a need exists for more effective and more efficient data compression techniques.
It is believed that the present invention achieves levels of data compression which heretofore have never been accomplished. Known data compression products generally cannot obtain compression greater than 50 percent for text and graphic files and are even less successful (approximately 45 percent compression) on program execution files. With the present invention, data compression levels of 90 percent (and greater in certain applications) can be achieved in no more time than it takes presently available data compression products to compress the same data to 50 percent levels. The present invention achieves these high compression percentages by locating and separating ordered streams of information from what appears to be random (or chaotic) forms of information. Prior methods of data compression are largely unsuccessful in finding order (redundancy) within data which otherwise appears to be randomly arranged (without redundancy). Consequently, they are ineffective for compressing that which they cannot find.
As is well understood, once ordered forms of data are extracted from the random data, such ordered data can be easily compressed.
A second aspect of the present invention which further enhances its ability to achieve high compression percentages, is its ability to be applied to data recursively. Specifically, the methods of the present invention are able to make multiple passes over a file, each time further compressing the file. Thus, a series of recursions are repeated until the desired compression level is achieved.