Data transferred over communication links between commercial computer systems generally contains significant redundancy. A number of mechanisms and procedures exist for lessening the redundancy and for creating substantially more efficient use of the transmission bandwidth. The term "data compression" refers to any process that converts data in a first given format into a second format having fewer bits than the original. Data compression systems are particularly effective if the original data contains substantial redundancy, such as symbols or strings of symbols which repetitively appear with high frequency.
Preferred data compression methods are transparent in that the application computer programmer is not aware of the existence of the compression method, except in system performance manifestations. As a result, decompressed data is an exact replica of the input data and the compression apparatus is given no special program information. So long as the transmission protocols are constant in the communication network, transparent compression can be readily accomplished. However, once various protocols and data formats find their way into a network, effective data compression becomes much more complex.
Such systems can be found today in wide area networks (WAN's), which interconnect pluralities of Local Area Networks (LAN's). In general, internal LAN interconnections occur over wide bandwidth, hard-wired or optical interconnects that alleviate the requirements for data compression. By contrast, most WAN's employ the telephone network for LAN interconnection purposes, and, as a result, are significantly bandwidth-limited.
There are a number of general purpose data compression procedures described in the prior art. A popular compression method, known as "Huffman" encoding translates fixed-size pieces of input data into variable-length symbols. The procedure assigns codes to input symbols such that each code length, in bits, is approximately log.sub.2 (symbol probability), where symbol probability is the relative frequency of occurrence of a given symbol, expressed as a probability. Huffman encoding exhibits a number of limitations. The bit-run size of input symbols is limited by the size of the translation table needed for compression. The decompression process is complex, and it is also necessary to know the frequency distribution for the group of possible input symbols.
A further type of encoding is known as "run-length" encoding and causes sequences of identical characters to be encoded as a count field appended to an identifier of the repeated character. While this approach is effective in graphical images, it has virtually no value in text and has moderate value for data files.
Recently, a method termed "adaptive" compression has appeared and has become, in various configurations, widely used. Algorithms for adaptive compression have been published by J. Ziv and A. Lempel, in "A Universal Algorithm For Sequential Data Compression", IEEE Transactions, Information Theory, Vol. IT-23, No. May 3, 1977, pp. 337-343 and in "Compression of Individual Sequences Via Variable-Rate Coding", IEEE Transactions on Information Theory, Vol. IT-24, No. 5, September, 1978, pp. 5306.
The Lempel-Ziv algorithm converts variable length strings of input symbols into fixed-length (or predictable length) codes. The symbol strings are selected so that all have almost equal probability of occurrence. Consequently, strings of frequently occurring symbols will contain more symbols than a string having infrequent symbols. This form of compression is effective at exploiting character frequency redundancy, character repetitions, and high usage pattern redundancy.
One of the first algorithms published by Lempel-Ziv (typically referenced to as LZ77) maintains a history buffer of the last N characters from the input data stream (typically 4,096) and encodes the output data stream as a sequence of symbols. If the character string is not found in the history buffer, it is encoded as a zero, followed by the unencoded eight bit character, resulting in a nine bit code. "Unencoded" in this sense means the eight bit binary character which corresponds to the alpha-numeric character. If a character or stream of characters is found in the buffer, the stream is encoded as a 1, followed by an index and length in the dictionary. This thereby enables the receiving end to reconstruct, from its copy of the buffer, the transmitted data.
More recently, a modification to a subsequent Lempel-Ziv data compression technique (LZ 78) published by T. Welch, has become known as the LZW algorithm. The LZW algorithm converts strings of varying lengths from an input data stream to fixed-length, or predictable length codes, typically 12 bits in length. The premise of the algorithm is that frequently occurring strings contain more characters than infrequently occurring strings.
Initially an LZW dictionary or code table is empty, except for the first 256 character positions which contain basic alpha-numeric single character entries. A new entry is created whenever a previously unseen string is encountered. The compressor searches the input stream to determine the longest match to a string stored in the dictionary. Each stored string comprises a prefix string and an extension character. Each string has a code signal associated with it. A string is stored in the string table by, at least implicitly, storing the code signal for the string. When a longest match between an input data character stream and a stored stream is determined, the code signal for the longest match is transmitted as the "compressed" code signal and a new string is stored in the string table. The prefix of the new string is the longest match of string characters and the suffix is an extension character which is the next data character from the input data that resulted in the longest match. Thus, as each compression occurs, the string lengths are increased by the addition of the extension character. Additional details of this algorithm can be found in U.S. Pat. No. 4,558,302 to T.A. Welch, and in an article by Welch entitled "A Technique For High-Performance Data Compression" IEEE Computer, June 1984, pp. 8-19.
While the LZW data compression algorithm is widely used, it does exhibit a number of shortcomings. For instance, during early stages in the construction of the dictionary, many data fragments (i.e. character lengths of 2, 3, or 4) will occupy large parts of the available dictionary storage. Thus, the amounts of compression available will be limited by the available strings. Often, in lieu of compressing the data, it will actually expand the data being transmitted. For instance, when only a single character is found to match, as will be the case in the early stages of dictionary construction, the outputting of a 12 bit code for an 8 bit input character will result in a 50% increase in data.
Accordingly, it is an object of this invention to provide a data compression method particularly adapted for use with packetized data.
It is another object of this invention to provide an improved data compression method which avoids the accumulation of short data strings in the compression dictionary.
It is still another object of this invention to provide a data compression method that accommodates a plurality of protocols employing different type packets.
Yet another object of this invention is to provide an altered method of operation for the LZW compression algorithm that enables the attainment of improved compression results.