The proliferation of computer networks coupled with the reduced cost of long distance services is resulting in a large volume of data being transferred over communication mediums. It has consequently become important to employ lossless data compression techniques to reduce the amount of traffic on networks, thereby effectively increasing channel capacity and reducing communication costs.
The term "data compression" refers to any process that converts data in a first given format into a second format having fewer bits than the original. "Lossless data compression techniques" refers to a data compression and decompression process in which the decompression process generates an exact replica of the original uncompressed data. "Application of data to a medium" or "applying data to a medium" refers to putting of the data on a communications medium or a storage medium. This involves the step of generating physical signals (electrical, electromagnetic, light, or other) which are sent (for a communications medium) or stored (for a storage medium).
Data transferred over communication mediums between commercial computer systems generally contains significant redundancy. Data compression techniques have been proposed as a means of reducing the redundancy content of the data, such that it could be transmitted in less time over communication channels. In general, data compression systems are particularly effective if the original data contains substantial redundancy.
There are many approaches to performing general purpose data compression in the prior art. A data compression method, known as "Huffman" encoding (see Huffman D. A., "A Method for the Construction of Minimal-redundancy Codes", Proceedings IRE, Vol. 40., No. 9, pp. 1098-1101, September 1952), has received considerable attention in the prior art. In this method, it is assumed that each byte within a data file occurs with a certain frequency. Huffman encoding works by assigning to each byte a bit string, the length of which is inversely related to its frequency. Huffman proposed an algorithm for optimally assigning the bit strings, based on the relative frequency of their corresponding bytes in the file. The resulting bit strings are uniquely decodable. In practice, Huffman encoding translates fixed size pieces of input data into variable length symbols. Huffman encoding exhibits a number of limitations. In its generic form, it requires two passes over the data to determine the correct frequency of the bytes. For real time data transmission systems, such a requirement hinders the efficiency of the data compression sub-system. In actual implementation, the bit-run size of input symbols is limited by the size of the translation table needed for compression. The decompression process is very complex and computationally expensive.
A second popular approach to data compression is known as "run-length" encoding. This method encodes repeated characters in a file in a format that consists of an escape character, repeat count and the character. The rest of the characters are encoded as plain text. The escape character is chosen as a seldom used character. The value of run-length encoding is dependent on the file type. Run-length encoding performs well on graphical image files, but has virtually no value in text files, and only moderate value in data files.
Another method of data compression is based on the idea of arithmetic coder. The term "arithmetic coder" refers to a method for performing an encoding operation during the process of compressing the data, and a decoding operation during the process of decompressing the compressed data. The term "arithmetic coder" refers to the means for performing arithmetic coding. The method of arithmetic coding was suggested by Elias and presented by Abramson (see Abramson, N., "Information Theory and Coding", McGraw-Hill, 1963). Practical implementations of Elias techniques were suggested by Rissanen (see Rissanen, J., "Generalized Kraft Inequality and Arithmetic Coding", IBM Journal Research Development, Vol. 20, pp. 198-203, May 1976), and most recently by Witten et al. (see, Witten, I. H. et al., "Arithmetic Coding for Data Compression," Communications of the ACM, Vol. 30, no. 6, pp. 520-540, June 1987). Arithmetic coding works by representing the source data as a fraction that assumes a value between zero and one. The encoding algorithm is a recursive one that continuously subdivides an interval and retains it to be used as the new interval for the next encoding step of the recursion. The recursive subdivision of the interval is done in proportion to probabilistic estimates of the symbols as generated by a given model. The decoding algorithm works in such a way that the decoder identifies the next symbol, using a division and a search, by first looking at the position of the received value within the current coding interval, and then proceeds to mimic the operation of the encoder to generate the new coding sub-interval. The arithmetic coder works in conjunction with a model that generates probability estimates of the symbols that have occurred in the data. The strength of arithmetic coding resides in the separation of the model from the coder. Arithmetic coding can be used with static or adaptive models. Static models assume fixed probability distributions of the symbols which are determined beforehand. The encoder and the decoder have access to the same model. Adaptive models are one pass methods which are dynamic in nature since they learn the frequency distribution of the data over time. In theory, arithmetic coding is optimal, and in practice it approaches the theoretical limit of the entropy of the model. In practice, arithmetic coding is far superior to techniques based on the better known Huffman method. Basically, Huffman coding produces an encoding with an average length that only approximates the entropy of the probabilities being generated by the model, while arithmetic coding has the ability to encode symbols using minimal average code length. In general, Huffman coding can be shown to be a special case of arithmetic coding. The main drawback of arithmetic coding is in its high computational complexity. The standard implementation of the algorithm requires up to two multiplications and one division for encoding each symbol, and up to two multiplications and two divisions to decode a single event. The optional operations of updating the model is not included in the estimate. Such computational complexity hinders the implementation of arithmetic coding in practical real time data transmission and networking systems.
Several approaches have been suggested to improve on the computational speed of arithmetic coding. Witten et al. presented a practical implementation of the algorithm that uses fixed precision registers and allowed for incremental transmission and reception of compressed data bits. The statistical model used by the Witten et al. arithmetic coder achieves high compression efficiency but is computationally very expensive to maintain. This is because the model is expected to update the symbol counts and the cumulative frequencies at every iteration of the encoding process. Moffat (see, Moffat, A., "Linear Time Adaptive Arithmetic Coding," IEEE Transactions on Information Theory, Vol. 36, No. 2, pp. 401-406, March 1990) proposed a modification to the adaptive model of Witten et al., and proposed an algorithm that enabled adaptive arithmetic coding to be performed in linear time as a function of the number of inputs and outputs. However, the approach does not address the time consuming divide instructions that are required by the algorithm. Howard (see, Howard P. G. and J. S. Vitter, "Practical Implementation of Arithmetic Coding" published in Image and Text Compression, edited by James A. Storer, Kluwer Academic Publications, proposed a variant on arithmetic coding called "quasi-arithmetic" coding, which is an arithmetic coder with a simplified number of states. The work shows that it is possible to achieve faster execution speed of the generic algorithm at the expense of some loss in compression efficiency. The main drawback of this approach is that it is suitable for small size alphabets, but is impractical for applications that require large size alphabets. There are other variations to arithmetic coding, which are specifically designed to handle binary alphabets (see Langdon, Jr., "An introduction to Arithmetic Coding", IBM Journal research Development, Vol. 28, No. 2, pp. 135-149, March 1984). They can achieve high compression efficiency for binary images and are generally not extendible to multi-symbol alphabets. Radford Neal (see, Neal R. M., "Fast Arithmetic Coding Using Low-Precision Division," Manuscript, 1987--source code available by anonymous ftp from fsa.cpsc.ucalgary.ca.) introduced an alternative approach to speed up the computational requirements of the arithmetic coder by using low precision division. The approach enhances on the performance of the algorithm with minimal reduction in computational efficiency. However, the approach uses the same statistical model proposed by Witten et al., and therefore is computationally expensive, particularly for real time networking systems.
Another approach for data compression was developed by Ziv and Lempel "ZL" (see, J. Ziv and A. Lempel, "A Universal Algorithm for Sequential Data Compression," IEEE Transactions on Information Theory, vol. IT-23, No. 3, May 1977, pp. 337-343), and its variants, the "LZW" as introduced by Welch (see, Welch Terry A., "A technique for High-performance Data Compression", IEEE Computer, pp. 8-19, June 1984). The ZL method assigns fixed-length codes to variable size strings. The ZL method maintains a history buffer of the last N (typically 4096) characters from the input data stream and encodes the output data stream as a sequence of symbols. If the character string is not found, it is encoded as zero followed by the original eight bit character, resulting in a nine bit code. If a character or stream of characters is found in the dictionary (history buffer), the stream is encoded as one followed by an index and length in the dictionary. The encoding procedure enables the receiving end to reconstruct from its copy of the buffer, the transmitted data, without the overhead of transmitting table information. In a typical implementation of the ZL method, the size of the index is in the range of 11-14 bits, with 12 bits as the most common due to the ease of its implementation. Hashing functions are generally used for the efficient matching of strings. The ZL method can achieve high compression efficiency particularly on files containing data consisting of long repetitive strings. The main drawback of the ZL method is that for long data files the dictionary tends to fill up. In this case, different approaches could be used to solve the problem. In one approach, the size of the dictionary could be increased. But this in turn requires the use of more bits to represent the index. Hence it reduces the compression efficiency. In an alternative approach, all or part of the dictionary could be discarded. However, due to the nature of the algorithm, that basically has infinite memory, it is difficult to come up with a table reduction strategy that minimizes the loss of compression ratio.
The LZW algorithm converts strings of varying lengths from an input data stream to fixed-length, or predictable length codes, typically 12 bits in length. The premise of the algorithm is that frequently occurring strings contain more characters than infrequently occurring strings. The LZW starts with an initial dictionary that is empty except for the first 256 character positions which contain the basic alpha-numeric single character entries. A new entry is created whenever a previously unseen string is encountered. The encoder searches the input stream to determine the longest match to a string stored in the dictionary. In the dictionary, each string comprises a prefix string and an extension character. Each string has a code signal associated with it. A string is stored in the string table by at least implicitly, storing the code signal for the string. When a longest match between an input data character string and a stored string is found, the code signal for the longest match is transmitted as the compressed code signal and a new string is stored in the string table. The LZW algorithm exhibits a number of short comings. Particularly during the initial stages of the construction of the dictionary, many data fragments will occupy large parts of the available dictionary space. This in turn will reduce the amount of achievable compression. In some cases, the method will actually expand data by up to 50% as opposed to compressing it.
Another method of data compression is used by the commercially available Stacker LZS.TM. compressor (see U.S. Pat. No. 5,016,009). In this method, an input data character stream is converted into a variable length encoded data stream. The method uses an array of history tables, and a hashing function that maps the characters into a string list, where a mechanism for finding the longest match is employed. This technique encodes variable length strings into variable length code strings that are further encoded using run-length encoding. The method is relatively computationally inexpensive, but suffers from the limitations of run-length encoding techniques. Consequently, the resulting compression ratios are very moderate.
Preferred data compression methods in a computer networking system are generally transparent to the end user. That is, the user is not aware of the existence of the compression method, except in system performance manifestations. As a result, decompressed data is an exact replica of the input data, and the compression apparatus is given no special program information. For optimal performance, reduced hardware costs and effective link utilization, it is preferred that the compression method be computationally inexpensive while achieving high compression efficiency.
In this regard, U.S. Pat. No. 5,293,379 teaches a data processing system for transmitting compressed data from one Local Area Network (LAN) to another LAN across Wide Area Networks (WAN). The data processing system employs an efficient mechanism that rearranges the protocol header fields and user data portions in LAN packets for efficient information compression and transmission over WAN's. The preferred technique considers the packets to be composed of static and dynamic fields, where static fields contain information that often remains constant during a multi-packet communication interval and dynamic fields contain information that changes for each packet. U.S. Pat. No. 5,293,379 describes a compression method which includes reformatting each data packet by associating its static fields with a first packet region and its dynamic fields with a second packet region. The process then assembles a static table that includes static information from at least an initial data packet's first packet region. It then identifies static field information in a subsequent data packet's first packet region that is common to the information in the static table. Such common information is encoded as to reduce its data length. The common static information is then replaced in the modified data packet with the encoded common static information and the modified data packet is then transmitted. A similar action occurs with the user data information. A single dictionary table is created for all packet headers, while separate dictionary tables are created for each user-data portion of a packet-type experienced in the communication network thereby enabling better compression.
Accordingly, one object of the present invention is to provide a data compression method for use with packetized data in a computer networking and communication system or data storage system that is computationally inexpensive to implement while achieving high compression efficiency.
Another object of the invention is to provide improved statistical data modelling methods that are computationally inexpensive to maintain while providing very good estimates of the probabilities of the processed data during the process of encoding or decoding.
Another object of the invention is to provide a two stage apparatus and method for performing data compression that achieves high compression ratios with reduced complexity in the implementation of the arithmetic coder.
Another object of the invention is to provide a data compression method that accommodates a plurality of protocols employing different types of packets in any computer network.