Data compression algorithms convert data defined in a given format to another format so that the resulting format contains fewer data bits (i.e., the ones and zeros that define digital data) than the original format. Hence, the data is compressed into a smaller representation. When the original data is needed, the compressed data is decompressed using an algorithm that is complementary to the compression algorithm.
Data compression techniques are used in a variety of data processing and data networking applications. Personal computer operating systems use data compression techniques to reduce the size of data files stored in the hard disk drives of the computer. This enables the operating system to store more files on a given disk drive. Data networking equipment use data compression techniques to reduce the amount of data sent over a data network. For example, when a web browser retrieves a file from a web server, the file may be sent over the Internet in a compressed format. This reduces the transmission time for sending the file and reduces the usage of the network, thereby reducing the cost of transmission.
The performance of data compression techniques is mainly determined by three major factors. The first factor is the amount of compression achieved, or the ratio of the number of starting data bits to the number of bits produced. The second factor is the speed of compression, or the time needed to produce these bits. The third factor is the amount of computational overhead, in particular the requirement for computer resources such as memory. Generally, the following relation holds among these factors: the more compression achieved, the slower is the process and the more overhead required; conversely, the faster the process, the lesser compression amount achieved.
Normally, a particular compression technique is chosen according to the characteristics of the application. For example, "off-line" applications, which are not performed in real time, typically give up speed and overhead to achieve better compression. On the other hand, "on-line" applications, and in particular communication applications, typically settle for lesser compression to gain more speed.
Packet-based communication networks (such as the Internet) transfer information between computers and other equipment using a data transmission format known as packetized data. The stream of data from a data source (e.g., a host computer) is divided into variable or fixed length "chunks" of data (i.e., packets). Routers in the network route the packets from the source to the appropriate data destination. In many cases, the packets may be relayed through several routers before they reach their destination. Once the packets reach their destination, they are reassembled to regenerate the stream of data.
Conventional packet-based networks use a variety of protocols to control data transfer throughout a network. For example, the Internet Protocol ("IP") defines procedures for routing data through a network. To this end, IP specifies that the data is organized into frames each of which includes an IP header and the associated data. The routers in the network use the information in the IP header to forward the packet through the network. In the IP vernacular, each router-to-router (or switch-to-router, etc.) link is referred to as a hop.
Communication applications, or programs which facilitate the transmission of data on a communication channel, have certain characteristics which should be considered when choosing a technique for compression. If compression is desired, each packet should be compressed before transmission by the selected compression technique. Since communication channels between computers, particularly networks employing telephone system connections, have limited capacity, greater compression of the data increases the total amount of information which can be transmitted on the available bandwidth. On the other hand, since data compression for communication systems is typically needed on-line, the need for greater compression must be balanced against the increased amount of time and resources required for the compression process as the amount of compression increases. These competing requirements can be balanced by the choice of the proper data compression technique.
In general, data compression techniques encode the original data according to a translation data dictionary referred to herein as the "encoding table". An encoding table contains a series of mappings between the original data and the compressed representations of the actual data. For example, the letter "A" may be represented by the binary string "010." The encoding table is typically derived from the data according to a selected scheme relating to various statistical information gathered therefrom, such as the frequencies of certain patterns in the data. Normally, the length of the bit representation in the encoding table for encoded data patterns is inversely related to the frequency of occurrence of these patterns.
Hereinafter, the term "text" refers to a stream of data bits which is provided as a unit to the compression algorithm, and includes but is not limited to, word data from a document, image data and other types of data. As noted above, the text can have features or characteristics such as internal patterns of data. The text can be compressed according to a number of different types of compression algorithms.
Hereinafter, the term "static compression algorithm" refers to algorithms which do not affect, update or otherwise change the encoding table for a given unit of text. Hereinafter, the term "dynamic compression algorithm" refers to algorithms for which the encoding table is constantly updated or changed according to features or characteristics of the text by a selected scheme. Hereinafter, the term "semi static compression algorithm" refers to algorithms for which the encoding table is occasionally updated or changed according to the text by a selected scheme. Hereinafter, the term "adaptive compression algorithm" refers to a dynamic or semi-static algorithm in which the encoding table is either constantly or occasionally updated or changed according to data pattern variations encountered in the text.
The last class of algorithms, adaptive algorithms, has a number of advantages. For example, these algorithms permit the encoding table to be adjusted to best reflect the data patterns in the text which is a "learning" capability. Furthermore, the encoding table need not necessarily be transmitted along with the encoded data, but rather can be fully rebuilt at the receiving end from the encoded data during decompression. Thus, this class of techniques is particularly well suited for data compression in a communication system.
Examples of such adaptive data compression techniques include the well-known Lempel-Ziv algorithms known, respectively, as LZ77 and LZ78, for constructing the encoding table (Ziv J., Lempel A.: A universal algorithm for sequential data compression, IEEE Transactions on Information Theory, Vol IT-23, (1977) pp. 337-343; Ziv J., Lempel A.: Compression of individual sequences via variable rate coding, IEEE Transactions on Information Theory, Vol IT-24, (1978) pp. 530-536). Waterworth (Waterworth J. R.: Data compression system, U.S. Pat. No. 4,701,745, Oct. 20, 1987) and Whiting et al.(Whiting D. L, George G. A., Ivey G. E.: Data compression apparatus and method, U.S. Pat. No. 5,016,009, May 14, 1991; Whiting D. L., George G. A., Ivey G. E.: Data compression apparatus and method, U.S. Pat. No. 5,126,739, Jun. 30, 1992) provide efficient implementations of the Lempel & Ziv LZ77 technique for identifying data patterns in the text. A similar fast implementation is given by Williams (Williams R. N., An extremely fast Ziv-Lempel data compression algorithm, Proceedings Data Compression Conference DCC'91, Snowbird, Utah, Apr. 8-11, 1991, IEEE Computer Society Press, Los Alamitos, Calif., pp. 362-371). In addition, Huffman (Huffman D.: A method for the construction of minimum redundancy codes, Proceedings IRE, Vol 40, (1952) pp. 1098-1101) provides an optimal encoding scheme. Finally, Brent (Brent R. P.: A linear algorithm for data compression, The Australian Computer Journal, Vol 19, (1987) pp. 64-68) provides a static technique that takes advantage of both LZ77 and the Huffman encoding scheme.
Although these well-known data compression techniques have been successfully employed, they have a number of disadvantages for communication systems. For example, the implementations of Whiting do not use statistical information from previous data packets to more efficiently compress current packets. Furthermore, the static technique of Brent requires the encoding table to be transmitted with the encoded data, thereby consuming valuable bandwidth. Some other methods of compression do not take advantage of the basic structure of data transmissions in communication systems, in which data are transmitted in packets rather than as a continuous stream. Thus, many of the currently available data compression techniques have significant disadvantages, particularly with regard to communication systems. Consequently, a need exists for an improved data compression scheme for data transmission applications.