1. Field of the Invention
The present invention relates to a method and apparatus for compressing and decompressing textual data stored in digital form in a lossless manner. In other words, the original data is reconstructed in its original form after having first undergone the compression and then the decompression processes. The data is assumed to be drawn from a particular alphabet which is specified in advance, such as the ASCII code, which consists of a 7 or 8 bit representation of a particular set of characters.
2. Description of the Prior Art
Many different types of text compression techniques are described in the prior art. The text compression techniques described herein are based on the text compression techniques developed by Lempel and Ziv, who developed two techniques for text compression which are similar but have important differences. These two methods were outlined in papers entitled "A Universal Algorithm for Sequential Data Compression," IEEE Transactions on Information Theory, Vol. IT-23, No. 3, pp. 337-343, and "Compression of Individual Sequences via Variable-Rate Coding," IEEE Transactions on Information Theory, Vol. IT-24, No. 5, pp. 530-536, and are referred to commonly as LZ77 and LZ78, respectively.
LZ77 is a text compression technique in which pointers to previously compressed material within a fixed size window are used to compress new material. The fixed size "compression window" is moved across the text data as it is being compressed to exploit the principle of locality, i.e., that data is likely to be most similar to proximal data. An example of LZ77 will now be described with respect to FIGS. 1(A) and 1(B).
In FIGS. 1(A) and 1(B), a small window size of 8 characters is assumed for illustrative purposes. As shown in FIG. 1(A), the text that has not yet been compressed is compared to the contents of the (up to) 8 character window containing the (up to) 8 characters of the compressed text immediately preceding the text which has not yet been compressed. The longest match starting at the beginning of the text which has not yet been compressed with a sequence in the 8 character window is identified. In FIG. 1(A), the longest match from the 8 character window is "BB." A pointer to this sequence ("BB"), its length (2), and an extension (the next character after the match in the text to be compressed) are then either transmitted or compressed and stored locally, depending on the application of the data compression algorithm. However, if no matching sequence is found in the 8 character window, then a literal character is transmitted. Once the block of data pointed to by the pointer has been compressed and the information regarding its compression has been transmitted, the window is moved by the number of characters referred to by the pointer (BB) plus the extension, if any. In addition, the pointer to the region of text being compressed is updated by this number of characters. As illustrated in FIG. 1(B), this process repeats for the shifted data until the data being compressed is exhausted.
LZ78 differs from LZ77 in that the text compression is achieved by parsing the data being compressed into phrases which are entered into a compression dictionary. Pointers to these phrases or dictionary entries are then used to compress new data. Initially, the dictionary contains only an empty string (phrase of length zero). The phrase to be compressed, at each step, is the longest phrase at the start of the new data such that the prefix of this phrase is an entry in the dictionary, where the prefix is defined to be the phrase with its final character removed. The remaining character is called the extension. Thus, when the first phrase is seen, it is encoded as a reference to a dictionary entry consisting of the empty string (which is the only entry initially found in the dictionary), followed by the last and only character of the phrase. This character is then placed in the dictionary (assuming the dictionary is not full), and the process of identifying a phrase and transmitting its prefix (as a reference to the dictionary entry matching the prefix) and extension is repeated. The not yet compressed data will then be compared to both dictionary entries: the empty string and the phrase consisting of the already encountered character. If the next character in the new data does not match the already compressed character, then it, too, will be compressed as the empty string plus the character being compressed. In this way, each phrase of the data is compressed as a prefix, which is found in the dictionary and is chosen to be as long as possible, and an extension, which is the character which follows the prefix in the input data. An example of LZ78 will now be described with respect to FIG. 2.
FIG. 2 shows a sample of LZ78 compression for a short string of characters. As shown, the dictionary initially starts with no entries aside from the empty string, which is referred to as .di-elect cons., and a pointer indicating the start of the character sequence to be compressed is placed at the beginning of the sequence to be compressed. The longest initial phrase whose prefix is in the dictionary is one character long, since the prefix of this phrase, namely .di-elect cons., is the only entry in the dictionary. The first character is therefore encoded as a reference to .di-elect cons. and the first character of the sequence being transmitted. Then, the dictionary is updated to contain the entry consisting of the concatenation of the used dictionary entry and the character following it in the compression sequence. The current pointer is then moved by the number of characters compressed, and the process repeats itself, repeatedly identifying the next phrase and transmitting a compressed version of it, until the stream of data to be compressed is empty. As will be appreciated by those skilled in the art, the LZ78 technique provides substantially more compression once the dictionary is formed. A more detailed description of a particular implementation of the LZ78 text compression technique is given in U.S. Pat. No. 4,464,650--Eastman et al., while a good general description of Lempel-Ziv coding techniques may be found in the text entitled "Text Compression," Bell et al., Englewood Cliffs, N.J., Prentice Hall, 1990.
Numerous data compression systems have been described in the prior art which utilize the concept of a compression dictionary as described by Lempel and Ziv.
For example, Giltner et al. describe in U.S. Pat. No. 4,386,416 a system for use in transmitting data over a Telex or similar network. The system described by Giltner et al. uses two dictionaries. The first is pre-filled with frequent words from the data's language, while the second dictionary is initially empty and is filled with words which are encountered in the data but which are not present in the first dictionary. When transmitting data, if a word is found in the first dictionary, an escape code and the number of the word's entry in the first dictionary are transmitted. If a word is not found in the first dictionary, it is compressed using Huffman coding and it is added to the second dictionary for later use. As a result, if the word is encountered again, it can be transmitted by sending an escape code indicating that the number of the word's entry in the second dictionary refers to the second dictionary, followed by the number of the word's entry in the second dictionary. Giltner et al. define a "word" to be either a predetermined number of characters or a sequence of characters surrounded by white space or a combination of white space and punctuation. A small, but fixed number of words common to all the types of messages handled by the Telex or similar network is provided in the first dictionary, and additional "words" are stored in the second dictionary. However, Giltner et al. do not address frequently occurring sequences of text which fall outside the limited definition of a valid word. As a result, Giltner et al. do not take advantage of the fact that the similarity of texts is greater when the comparison between them is made at the level of character sequences. Also, Giltner et al. do not address how words which occur frequently in the text can be chosen for the first dictionary whereby it is filled with valid words which are frequent within the type of text being transmitted. Giltner et al. also fail to teach how to identify the most appropriate library of text or the identification of the genre of the document to be compressed. Furthermore, Giltner et al.'s library is fixed; users cannot create their own pre-filled dictionaries as needed.
Similarly, Weng describes in U.S. Pat. No. 4,881,075 an "adaptive" data compression technique which uses two dictionaries. The first dictionary is used to perform compression or decompression while the second is being rebuilt to better reflect the local characteristics of the most recent input data. The second dictionary is then used to compress and decompress the input data while the first dictionary is being rebuilt using the most current input data. Weng repeatedly switches between dictionaries until compression is completed.
Kato et al. describe in U.S. Pat. No. 4,847,619 a modification to adaptive compression techniques in which the compression system's degree of compression is monitored and the dictionary is reset when the degree of compression drops below a threshold. The reset is not permitted to occur before the dictionary is sufficiently full in order to prevent the dictionary from resetting prematurely. This technique could be used in conjunction with a LZ compression technique, or any other adaptive technique.
In U.S. Pat. No. 5,153,591, Clark describes a modification to the Lempel-Ziv compression algorithm in which the dictionary is stored as a tree data structure. This allows large dictionaries to be stored in less space than in the original embodiment described in U.S. Pat. No. 4,464,650. In addition, it allows these dictionaries to be searched more easily and more quickly.
In U.S. Pat. No. 5,243,341, Seroussi et al. outline a Lempel-Ziv variant in which two dictionaries are used. The first dictionary is used until it is filled, then it is replaced with a standby dictionary, which is filled with those entries from the first dictionary which yield the most compression before compression continues.
Many other modifications to the original Lempel-Ziv compression techniques appear in the prior art.
For example, Welch describes in U.S. Pat. No. 4,558,302 an implementation of Lempel-Ziv in which the encoding and decoding processes require less complicated computation and, therefore, are faster than in the implementation described in U.S. Pat. No. 4,464,650--Eastman et al.
Miller et al. suggest in U.S. Pat. No. 4,814,746 several modifications to the Lempel-Ziv algorithm. The first of these modifications is to include all possible characters in the dictionary before compression actually begins. As a result, it is not necessary to transmit a flag which indicates that the following datum is a character rather than a pointer. In addition, Miller et al. also associate a time stamp with each dictionary entry in order to facilitate the removal of the least recently used entry when the dictionary becomes full. These modifications are aimed at reducing the memory requirements by limiting the dictionary to a fixed size and improving compression by allowing the dictionary to more accurately reflect the current characteristics of the data being compressed.
Storer describes in U.S. Pat. No. 4,876,541 a compression technique which does not suffer from some of the same difficulties as prior Lempel-Ziv techniques. In particular, unencoded characters never need to be transmitted, since the compression dictionary initially contains all of the characters in the alphabet, as in U.S. Pat. No. 4,814,746--Miller et al. In addition, a least recently used queue is maintained so that the dictionary can be purged of less useful entries. The encoding and decoding dictionary in Storer's system can vary in size, and there may be several active at a time. The compression ratio of each of the dictionaries is monitored, and the one which yields the best compression is used.
In U.S. Pat. No. 4,906,991, Fiala et al. describe a substitution style data compression technique which is somewhat similar to Lempel-Ziv compression. Their technique relies on searching a fixed window of characters (e.g. 4096 characters) which have already been compressed in order to determine whether the text being compressed can be encoded as a pointer to a location within the window. If the text being compressed can be encoded in this manner, a pointer to the starting location, along with the length of the overlap between the text being compressed and the location within the window is generated. If the text being compressed cannot be encoded in this manner, it is encoded as a length followed by a literal string of that length. Like LZ78 compression, this technique fails to compress data much at the beginning of documents, since the window is devoid of strings which can be pointed to in order to bring about compression.
O'Brien suggests in U.S. Pat. No. 4,988,998 modifications to the Lempel-Ziv algorithm which allow for enhanced compression of data which contains long strings of repeated characters. Since the Lempel-Ziv algorithm adds entries to the compression dictionary by appending a single character to an existing dictionary entry, it will take many occurrences of a repeated string of characters before such strings are found in the dictionary. Accordingly, O'Brien preprocesses the data using a run-length encoding technique in which the run-lengths are inserted into the text. The resulting combination of text and run-lengths for repeated characters is then compressed using the Lempel-Ziv technique.
In U.S. Pat. No. 5,049,881, Gibson et al. describe a data compression system which creates its own pointers from the sequence of characters previously processed and emphasizes maximizing the product of the data rate and the compression ratio. Thus, previously input data is used as the dictionary and is combined with a hashing algorithm to find candidates for string matches without the requirement of a string matching table.
In U.S. Pat. No. 5,058,137, Shah describes a Lempel-Ziv decoder which has memories for storing code words and data separately. Upon receipt of a code word, the decoder stores the previously received code word, applies the newly received code word to the code word memory to obtain the location of the last data element which is part of the data represented by the newly received code word, and another code word associated with the prefix. Upon completion of decoding the latest code word, the first data element of the decoded word is appended to the next previously received code word, and the combination is stored as the equivalent of a code word which is next after the highest code word already received. At least one memory is shared for use during encoding and decoding.
In U.S. Pat. No. 5,087,913, Eastman describes a Lempel-Ziv algorithm which uses a searchtree database to allow later portions of the data to be decompressed without having to decompress all preceding portions. The searchtree database is grown to a fixed, predetermined size and is allowed to grow no further. The fact that the compression searchtree database is established in advance of decompression allows decompression of portions of the data without decompressing the entire preceding portion of the data.
In U.S. Pat. No. 5,140,321, Jung details a Lempel-Ziv modification which allows for enhanced compression speed at the cost of a reduction in compression. Rather than attempting to find the optimal matching substring in the entire compressed portion of the data, the principle of locality is exploited and only the most recent compressed data in a first-in-first-out buffer is examined to find a matching sequence. A hash table is used to store the strings which have been compressed recently and to allow for fast retrieval of matching strings.
In U.S. Pat. No. 5,179,378, Ranganathan et al. describe a Lempel-Ziv implementation which uses a systolic array of processors to improve performance by forming fixed-length codewords from a variable number of data symbols.
In U.S. Pat. No. 5,262,776, Kutka describes an implementation of the Lempel-Ziv algorithm which takes advantage of a tree data structure to avoid the search step normally required by the compression process. The sequence of elements in a primary sequence is converted into elements in a reduced set of elements using escape sequences. This technique is particularly suited to compressing data representing the coefficients of a discrete cosine transform of an image.
In addition to the above modifications to the Lempel-Ziv compression technique which are described in the patent literature, others appear in technical journals.
For example, in "Linear Algorithm for Data Compression via String Matching," Rodeh et al. describe a modification to the LZ77 technique in which the size of the window is not fixed. As a result, pointers to previous strings in the compressed portion of the data grow in length and are encoded in a variable-length code.
In "Better OPM/L Text Compression," Bell describes a Lempel-Ziv variant referred to as LZSS in which not all compression is done using a combination of a prefix and an extension. Instead, if the cost of transmitting a pointer is higher than that of merely transmitting a character or sequence of characters, then the character or sequence of characters is transmitted. A binary search tree is used to find the longest string match, without a limit on the length of the match.
In "A Technique for High-Performance Data Compression," Welch describes a modification to LZ78 in which only pointers to previously compressed data are used, rather than a combination of pointers and characters. A string table is formed which maps strings of input characters into fixed-length codes. For every string in the table, its prefix string is also in the table. The string table contains strings that have been encountered previously in the message being compressed. It consists of a running sample of strings in the message so that available strings reflect the statistics of the message. This technique is commonly referred to as LZW. It uses a "greedy" parsing algorithm in which the input string is examined character-serially in one pass, and the longest recognized input string is parsed off each time. The strings added to the string table are determined by this parsing.
In "Variations on a Theme by Ziv and Lempel," Miller and Wegman describe another variant of LZ78. In their version, the dictionary is filled in advance with all strings of length 1 (that is, all of the characters in the alphabet over which compression is taking place), which helps to reduce, but does not eliminate the problem of starting with a dictionary devoid of useful entries. Also, rather than reset the compression dictionary when it becomes full, they propose to delete strings from the dictionary which were least recently used. However, the largest contribution of their version, which will be referred to as LZMW, is that extensions are never transmitted. Rather, since the dictionary begins with all strings of length 1, it is possible to encode all of the data using the initial dictionary. However, this would result in no compression. Instead, the dictionary is grown by adding entries to the dictionary which consist of the concatenation of the previous two matches.
In FIG. 3, a sample of LZMW compression is shown. The alphabet is assumed to contain only 3 characters (A,B,C) for purposes of illustration. As shown, the compression dictionary initially contains the characters of the alphabet. The pointer to the character at which compression will begin is placed at the first character in the sequence of text being compressed. The longest match within the dictionary is found and a pointer to this entry is transmitted. Alternatively, the pointer may be stored locally depending on the application of the compression algorithm. The pointer is then moved by the number of characters transmitted. Normally, the dictionary is updated to contain the concatenation of the two previously transmitted dictionary entries, but since there was no previous transmission, this step is skipped. On the second iteration through the processing cycle, the largest match is found, its dictionary entry number is transmitted, the pointer is moved by the appropriate number of characters, and the concatenation of the two previous dictionary entries transmitted is added to the dictionary as a new entry. This process then repeats until the data to be compressed has been exhausted.
Commonly used compression algorithms also use variations of the LZ77 and LZ78 algorithms. For example, the compression algorithm used in commercially available "zip" and freely available "gzip" is a variation of LZ77 which finds duplicated strings in the input data. In "gzip," the second occurrence of a string is replaced by a pointer to the previous string in the form of a pair (distance, length). When a string does not occur anywhere in the previous number of bytes within a designated distance, such as 32 Kbytes, it is transmitted as a sequence of literal bytes. Literals or match lengths are compressed with one Huffman tree, and match distances are compressed with another tree. The trees are stored in a compact form at the start of each block. The blocks can have any size, except that the compressed data for one block must fit in available memory. A block is terminated when "gzip" determines that it would be useful to start another block with fresh trees. Duplicated strings are found using a hash table. All input strings of length 3 are inserted in the hash table, and a hash index is computed for the next 3 bytes. If the hash chain for this index is not empty, all strings in the chain are compared with the current input string, and the longest match is selected. The hash chains are searched, starting with the most recent strings, to favor small distances and thus take advantage of the Huffman coding. The hash chains are singly linked. There are no deletions from the hash chains; the algorithm simply discards matches that are too old. To avoid a worst-case situation, very long hash chains are arbitrarily truncated at a certain length as determined by a runtime option. As a result, "gzip" does not always find the longest possible match but generally finds a match which is long enough.
Unfortunately, despite the large number of Lempel-Ziv variants in the prior art, none adequately addresses the problem that beginning with a dictionary completely devoid of words virtually prevents small files from being compressed at all and prevents larger files from being compressed further. U.S. Pat. Nos. 4,814,746 and 4,876,541 and the work of Miller and Wegman begin to address this problem by beginning with a dictionary containing all of the characters in the character set in which the data is encoded. This solves the problem of needing to transmit escape codes to indicate that what follows is not a dictionary entry number, but a character.
However, the present inventors have found that further compression may be obtained by observing that many documents, especially those textual documents written in either natural human languages or computer programming languages, such as English or C, have a small number of words which are statistically extremely frequent and which should be part of the compression dictionary. Zipf has shown this to be true for language in a book entitled Human Behavior and the Principle of Least Effort. In fact, the frequency of words in English obeys what has come to be known as a Zipfian distribution. That is, the product of the rank of a word and its frequency is approximately constant. Thus, the second most common word will appear roughly half as many times as the most frequent word. This implies that the most common words will comprise a large fraction of the total occurrences of all words in a document. For instance, in the Wall Street Journal data collected as part of the treebank project described in an article by Marcus et al. entitled Building a Large Annotated Corpus of English: The Penn Treebank, Computational Linguistics, Vol. 19, No. 2, pp. 313-330 (1993), the first ten "words", which are ",", "the", ".", "of", "to", "a", "and", "in", "'s", "is" and "that" account for slightly more than 25 percent of all of the words in the corpus. The 100 most frequent words alone account for 48.1 percent of all words in the corpus, while the 500 most frequent words account for 63.6 percent. This means that the remaining 80901 words account for the remaining 37.4 percent of all words in the corpus. Thus, the average word found in the top five hundred words is approximately 275 times more frequent than the average word not found in the top five hundred words.
It is thus desired to modify the Lempel-Ziv text compression techniques described in the prior art to take advantage of this observation so as to allow further compression of large documents as well as significant compression of smaller documents.