1. Field of the Invention
The invention relates to a technique of compressing data for transmission over a communications network, and more particularly, to a compression method based on the statistical probability of a word or character occurring.
2. Description of the Related Art
Various compression algorithms have been utilized to reduce the amount of data that to be transmitted between two devices connected by a transmission line or a communications network. Compression algorithms such as the PKZIP, PKUNZIP and PKARC which use the .ZIP and .ARC file formats are widely available on on-line services, such as Compuserve, and other electronic bulletin boards. Another type of algorithms that is widely used is the Lempel-Ziv data compression algorithm. The Lempel-Ziv algorithm is well known in the art and a good overview is provided in Ross N. Williams, An Extremely Fast ZIV-Lempel Data Compression Algorithm, DCC '91 Data Compression Conference pp. 362-371 (1991), which is hereby incorporated by reference. The Lempel-Ziv algorithm dynamically converts strings of input symbols or characters into fixed length codes. This algorithm builds a table of strings with their corresponding codes during the compression process. The table of strings is referred to as an uncompressed destination block. Thus, each time a new string is encountered, that string is placed into the uncompressed destination block and provided with an equivalent code. As a result, the first occurrence of any string is not compressed. The uncompressed string is referred to as a literal.
The Lempel-Ziv algorithm generates control words to indicate whether a string is compressed or uncompressed. The control word consists of an offset count, a literal count, and a copy count. The literal count specifies the number of bytes following the control word that are literals. The number of bytes specified by the literal count are copied to the uncompressed destination block directly.
Thus, the offset count and the copy count are the fields through which actual compression is accomplished. If it is determined that the string is already in the uncompressed destination block, then the offset count and copy count are assigned non-zero values. The offset count specifies how far back into the previously written destination block, known as "the history," to look to find the string. The copy count indicates the number of bytes to copy from that offset count to retrieve the string. Thus, for example, a copy count of 3 and a offset count of 4 would cause the algorithm to look back 4 bytes into the history, and from there, copy 3 bytes to the current location when the compressed data is being decoded. Since the offset count is represented by a limited number of bits, it can access only a portion of the entire history. That portion of the history is known as the "Lempel."
To reduce the amount of time required to determine if a matching word is in the history, Lempel-Ziv algorithms typically use a hash function and hash table to generate compressed data. See Donald E. Knuth, The Art of computer Programming, pp. 506 et seq. (1973), which is hereby incorporated by reference. In one implementation of the Lempel-Ziv algorithm, for example, the hash function maps any 3-byte key to a single pointer located in the hash table that can point anywhere in memory, but which is likely to point to a matching key somewhere in the Lempel. The 3-byte key consists of the next 3 bytes of input data that are to be compressed. Thus, the 3-byte key accesses a single location in the hash table. The pointer address contained at that location is retrieved, and the data pointed to is then compared with the data to be compressed. If the data pointed to in the Lempel actually matches the data to be compressed, then a copy count and an offset count are generated to represent the compressed data.
With the Lempel-Ziv algorithm, since the first occurrence of a character or word is not compressed, the compression ratio during the initial portion of a document or text file is poor. In addition, the compression ratio also depends upon the redundancy characteristics of the document or text file. Thus, if the document or text file is not homogenous and its redundancy characteristics shift, then compression efficiency declines if the length of the document or text file significantly exceeds the length of the Lempel.
Another popular compression algorithm is the V.42bis procedure, described in detail in Data Communication Over the Telephone Network: Data Compression Procedures for Data Circuit Terminating Equipment (DCE) Using Error Correction Procedures, The International Telegraph and Telephone Consultative Committee (CCITT) Recommendation V.42bis, pp. 1-27 (1990), which is hereby incorporated by reference. Data is compressed according to the V.42bis algorithm by encoding a string of characters as a fixed length code word. The algorithm uses dictionaries in which the strings are stored. The dictionaries are dynamically updated during normal operation. A feature of the V.42bis algorithm is that infrequently used strings are deleted from the dictionary to conserve storage capacity.
The dictionary is represented as a set of trees as shown in FIG. 1. Each tree contains root nodes and leaf nodes, wherein a root node represents the first character in a string and a leaf node represents the last character in a string. A node that has no dependent nodes is a leaf node and a node that has no parent represents a root node. In FIG. 1, the root nodes are nodes A, B, C, and D, and the leaf nodes are nodes G, R, T, N, and G. Node B is the parent of nodes A and I, node I is the parent of N, and so forth. Each tree represents a set of known strings beginning with one specific character and each node or point in the tree represents one of the set of strings. Thus the trees shown in FIG. 1 represent the strings A, B, BA, BAG, BAR, BAT, BI, BIN, C, D, DE, DO and DOG. Each node is identified with a fixed length code word.
In the V.42bis algorithm, the dictionary generated by the device transmitting the string of characters is transmitted along with the compressed data to the receiving device so that the receiving device is able to decompress the received data. Thus, the V.42bis algorithm shares the disadvantage of the Lempel-Ziv in requiring that the first occurrence of any string must be transmitted.
The V.42bis algorithm dynamically updates the dictionary as it receives input strings. The dictionary is initialized to an initial condition, where each tree in the dictionary consists only of a root node. The code word associated with each root node is initially assigned a length of 9 bits. The maximum length of the code word is programmable by the user. The procedure for matching a sequence of characters with a dictionary entry is as follows. Initially, the first character of the sequence is used to form a string which is placed into the last entry of the dictionary. Next, the string is compared with entries in the dictionary to determine if a match occurs. If so, it is determined if the match is with the entry that was just created. If not, it is confirmed that a match has occurred, and the next character from the input sequence of characters is appended to the string. The two-character string is then compared again with entries in the dictionary to determine if a match occurs. This process is continued until the string does not match a dictionary entry or it matches the last entry that was just created in the dictionary. If either of these two conditions occurs, then a matching entry has not been found in the dictionary. In that case, the last character that was appended to the string is removed, and the resulting string of characters represents the longest matching string that exists in the dictionary. The character that has just been removed from the matched string is then considered to be the unmatched character for the purpose of updating the dictionary and for restarting the string matching procedure.
A new string is formed in the dictionary by appending a single character to an existing string, thereby adding a new node onto a tree. The single character is the unmatched character resulting from the string matching operation. However, if it is determined that appending the single character to an existing string would exceed a predetermined maximum string length, then the new string is not added.
When all the entries in the dictionary have been filled, that is, the storage space for the dictionary has been used up, the V.42bis algorithm provides for a procedure for removing unused entries from the dictionary. The procedure accesses each entry in the dictionary to determine if a particular node is not being used. Upon identification of such an unused node, it then determines if the node is a leaf or a root node. If it is a leaf node, then the node is detached from its parent. Otherwise, the procedure continues on to the next entry.
As noted above, the size of the code word used to represent each node in the tree can be dynamically increased as necessary. As new characters are added to the dictionary, that is, new nodes are added to the tree, the size of the code word representing the node or character is compared to the current code word size. If it is determined that the size of the code word for the new character is longer then the current code word size, the current code word size is increased accordingly to allow for the representation of the new character. Consequently, one disadvantage of V.42bis algorithm is that, if a character requiring a large code word size is encountered at the beginning of the document, then the compression algorithm will be relatively inefficient.
Therefore, it is desired that a more efficient compression algorithm be developed to reduce the amount of data that need be transferred between devices connected via a communications network.