The rapidly growing use of computer-based information systems interconnected with communication networks has dramatically increased the use of digital storage and digital transmission systems. Data compression is concerned with the compaction of data before storage or transmission. Such compaction is useful for conserving memory or communication resources. When the data source can be modeled by a statistical system, optimal coding schemes have been constructed to achieve desired compaction criteria. However, for real-world data, the source statistics are not always known to the data compressor. In fact, real-world data usually does not conform to any statistical model. Therefore it is important in most practical data compaction techniques to have an adaptive arrangement which can compress the data without knowing the statistics of the data source.
Much stored or transmitted data is redundant. The English language, for example, or a programming language, includes "words" which are often reused. One type of coding which takes advantage of this redundancy is the well-known Huffman code. In the Huffman scheme, variable length code words are used, with the length of the code word being related to the frequency of occurrence of the encoded symbol. Unfortunately, the Huffman approach requires two passes over the data, one to establish the frequency of occurrence of the symbols or tokens and another to do the actual encoding. Moreover, the Huffman technique requires temporary storage from the entire data block while the first pass is taken, thereby incurring a corresponding time delay.
In June, 1984, Welch published a paper entitled "A Technique for High-Performance Data Compression" in the IEEE Computer Magazine. The paper treated an algorithm, which had become known as the Lempel-Ziv algorithm, in a practical way, and proposed an implementation for data compression based on hashing for fast on-line processing. U.S. Pat. No. 4,558,302, having Welch as the sole inventor, covers the details of the implementation first introduced in theoretical form in his paper. More recently, U.S. Pat. No. 4,906,991, issued to Fiala and Greene, disclosed a sophisticated modification to the Lempel-Ziv algorithm which achieves better compression on most text files--but at the cost of significantly increased complexity.
In April, 1986, Bentley, Sleator, Tarjan and Wei published a paper entitled "A Locally Adaptive Data Compression Scheme" in the Communications of the ACM. In the paper, the authors proposed the use of a self-adjusting data structure to achieve data compression of text data. One of their main schemes used a "move-to-front" rule; this concept will be expanded upon below.
More recently, the disclosure of U.S. Pat. No. 4,796,003, issued to Bentley, Sleator and Tarjan (Bentley et al), indicates that it is possible to compresses data with a compaction factor comparable to Huffman coding, but with a one pass procedure. More particularly, a system and an algorithm are used in which a word list is maintained with the position of each word on the word list being encoded in a variable length code, the shortest code representing the beginning of the list. When a word is to be transmitted in communication applications (or stored in memory applications), the list or codebook is scanned for the word. If the word is on the list, the variable length code representing the position of the word on the list is sent (or stored) instead of the word itself and the word is moved to the head of the word list. If the word is not on the word list, the word itself is transmitted (or stored), and then that word is moved to the head of the word list while all other words on the word list are "pushed down" while maintaining their relative order.
The receiver (or retriever in memory storage applications) decodes the data by repeating the same actions performed by the transmitter (or the storing mechanism). That is, a word list is constructed and the variable length codes are used to recover the proper words from the word list.
In the scheme of Bentley et al, the most often used words will automatically congregate near the front of the word list and hence be transmitted or stored with the smallest number of bits. Moreover, arbitrary prefixed codes can be used to transmit or store word positions on the list, low positions being encoded with the shortest codewords. Also, the list organization heuristics can be varied such as, for example, by moving the selected word ahead a fixed number of places or transposing it one position forward. Finally, the list positions themselves can be treated as new input data and the compaction scheme applied recursively to its own output, creating a new list and new variable length codes.
As alluded to, the encoder of the move-to-front implementation of Bentley et al has two operations, namely, (1) Search: for each input word, search for it in the codebook; and (2) Update: reorganize the codebook for further use. The implementation of Bentley et al organizes the codebook as a linear list. Both the search and update operations are done in linear fashion, i.e., they use linear search and linear update algorithms. The time complexity of each operation is in proportion to the codebook size, which is typically in the thousands to the tens of thousands. Thus, the complexity is high. In the earlier paper by Bentley, Sleator, Tarjan, and Wei, the codebook is organized as a doubly-linked double tree. The trees are adjusted after each input word to maintain depth balance. Thus either the search or the update operation can be accomplished in complexity proportional to the logarithm of the codebook size. But the complicated data structure results in extremely large memory requirements, and the coefficient of the logarithmic complexity can also be large. Thus, the complexity of this latter scheme may not even be less than the linear approach for codebook sizes of practical interest
Even more recently, the disclosure of U.S. Pat. No. 5,239,238, issued to Wei, presents an approach wherein only a small, constant number of steps is required to process each source symbol. In Wei, the codebook is organized as a collection of varying-size doubly-linked lists, designated the multiple-doubly-linked (MDL) list. For a codebook size of 2.sup.m -1, there is a single list which is subdivided into sublists of size 2.sup.0 =1, 2.sup.1 =2, 2.sup.3 =8, . . . , 2.sup.(m-1). For the Search operation, an associative memory is searched to determine if each incoming symbol is present or absent in the codebook. The associative memory is a memory arrangement which is accessed by symbol, rather than address. In a hardware implementation, the associative memory is realized by a Content Addressable Memory (CAM), whereas in a software realization the associative memory is effected via a hashing function operation. If a symbol is present, recency rank information about the symbol is converted to a data stream for propagation on the communication medium. In addition, the recency rank of the symbol is changed to reflect its recent appearance. The recency rank is changed by merely altering entries in the MDL list. These alterations, for example, are effected using a class promotion technique wherein the given symbol, when present, is generally moved to the top-most position in the next highest class. The symbol previously occupying this top-most position is moved, for instance, to the bottom of the class previously occupied by the given symbol. In another example, the symbol is moved half-way to the top of the class list and the symbol occupying that half-way location is moved to the location vacated by the symbol. If a symbol is not present, then the symbol is stored in an empty location in the associative memory or, if the associative memory is full, an overwrite of an occupied location occurs. The time complexity in the Search is just one step, namely, just one read for a hardware CAM, or one hash for the software version of the associative memory. The Update operation involves merely updating a constant number of pointer operations on the MDL.
Basically the prior art addresses data streams that exhibit "temporal locality of reference", that is, if a word appears at a certain point in the data stream, then there is a likelihood that the same word will appear again soon thereafter. When a word appears more than once in the data stream, the various prior art compression techniques represent the word for the second and subsequent times it appears by compacted data blocks that denote the number of different words appearing between the time the repeated word reappears and the time it last occurred.
The art is devoid of teachings or suggestions that exploit what might be called "spatial locality of reference", that is, if a first word appears near a second word in the data stream, then there is a likelihood that when the first word reappears in the data stream, it will appear in the vicinity of the second word again.