The present invention relates to a system and method for compressing and decompressing data in real time.
Data compression is possible because the data people normally use contains a lot of redundancy, that is, parts of the data are repeated throughout the whole message. Two major types of compression algorithms exist, lossless compressors and lossy compressors. The difference between the two is that lossy algorithms lose information about the data in the process of compression and therefore cannot decompress the data exactly as it was. Lossless compression algorithms can be separated in two different classes, statistical methods and dictionary methods. The present invention is based on a lossless dictionary algorithm.
Lossless compression and decompression algorithms are well known in the art. Reference may be made for example to the LZ77 algorithm, and its variations (see for example U.S. Pat. Nos. 5,155,484 (Chambers), U.S. Pat. No. 4,701,745 (Waterworth) and U.S. Pat. No. 5,521,597 (Dimitri), to name a few).
The LZ77 compression algorithm reduces the length of messages by replacing repeated patterns of data by references to those patterns of data. The references are given with a pair of numbers, the length of the repeated data and its xe2x80x9coffsetxe2x80x9d from the previous occurrence. For example, the string xe2x80x9cabracadabraxe2x80x9d would be compressed as xe2x80x9c less than abracad greater than  less than 4,7 greater than xe2x80x9d meaning that the first seven characters contained no repeated data and that the last three could be referenced as a repetition of 4 characters 7 characters back in the string. Four characters (32 bits) were replaced by a pair (2) of numbers sometimes called a reference. Compression therefore occurs if one is able to express the two numbers in less than 32 bits. Since bits have no meaning unless a given representation exists, one also has to be able to differentiate between a pair of numbers and a sequence of normal characters (this is denoted by the  less than  greater than  symbols). This means that additional bits are needed to determine if bits represent a reference or actual characters.
This type of algorithm, and its derivatives, are efficient when it comes to compressing a whole file. Indeed, one of the characteristics of the LZ77 algorithm is that it requires a xe2x80x9cstaticxe2x80x9d file to be able to perform compression.
With the advent of the Internet, however, communications between two or more users are more and more frequent. These communications are either uncompressed or compressed. Compressed communications require the user to perform a compression of the file prior to sending the communication, and require the receiver to perform decompression to read the communication. On the other hand, uncompressed communications require a large bandwidth.
There is thus a need to preserve bandwidth on a communications networks while still allowing for data compressions, and more particularly, for a system and method to perform data compression and decompression in real time.
The present invention features a system and method for efficiently and quickly compressing and decompressing data in real time utilizing several novel approaches to data compression, including the dynamic representation of coded values.
One feature of the present invention processes characters by checking to see if they have been seen before in the input string. If they haven""t, the present system and method adds them to a dictionary or table, and if they have, the system and method finds the longest match possible with the previous occurrence of the character(s) and the one(s) under examination. When a match is found, the present invention writes a representation of the characters that could not be compressed (if there are characters between the last match and this one) and a representation for the matching characters. This process is repeated until there are no characters left in the input string. The present system and method checks for previous encounters of one or more characters, and uses a lookup table for that purpose. The lookup table references simply linked lists of the same characters. The first node of the list references the most recent occurrence of the characters.
The present invention also utilizes a vector of integers called expansion schemes, which is a binary representation of the length of characters NOT compressed (nclen); the length of the matching characters (mlen); and the number of characters processed since the last match (offset). Expansion schemes produce variable length codes that are efficient in representing values that obey strictly decreasing probability distribution functions. Expansion schemes also provide a highly economical way, in terms of memory usage, of storing bit representations. Depending on the number of bytes that have been processed, the compressor changes its expansion schemes to reflect changes in the probability distribution of the lengths and offsets.
The decompressor associated with the present invention utilizes an input string comprised of tokens representing the original string. The output string will contain the decompressed string. The decompressor reads a token, if the token is an nctoken (a bit indicating a non-compressed sequence, the nclen and the characters themselves) then it reads the length of the sequence, copies those characters to the output string, reads the ctoken that follows and copies mlen characters at the given offset to the output string. If the token is a ctoken (a bit indicating the sequence, if it immediately follows another ctoken, and the match length and offset pair or mlen and offset in short form) it copies mlen characters at the given offset to the output string. This goes on until all of the tokens are read.