1. Field of the Invention
This invention relates to computer-implemented systems for adaptive compression of data and, more specifically, to a system for building a static compression dictionary which can be used for software or hardware compression procedures.
2. Discussion of the Related Art
For many years, data compression has been implemented as a host software task. Recently, there has been a trend toward implementing hardware data compression, especially within data storage subsystems and devices. This strategy reduces the host workload and increases effective storage capacity and transfer rate. Increases in VLSI density and continuing improvement of sophisticated data compression procedures that automatically adapt to different data have encouraged this trend.
The problems presented by data compression procedures include the difficulties with updating adaptive dictionaries and the processor overhead associated with developing and adapting such a dictionary over time. Practitioners in the art have proposed powerful adaptive compression procedures, such as the Ziv-Lempel adaptive parse-tree, for compressing data and for evolving the code dictionary responsive to data characteristics.
The Ziv-Lempel algorithm was first described by Ziv, et al ("A Universal Algorithm For Sequential Data Compression", IEEE Trans. Info. Theory, IT-23, No. 3, pp. 337-343, May 1977). The basic Ziv-Lempel encoder has a code dictionary in which each source sequence entry has an associated index (code) number. Initially, the dictionary contains only the null-string root and perhaps the basic source alphabet. During the source data encoding process, new dictionary entries are formed by appending single source symbols to existing dictionary entries whenever the new entry is encountered in the source data stream. The dictionary can be considered as a search tree or parse-tree of linked nodes, which form paths representing source symbol sequences making up an "extended"source alphabet. Each node within the parse-tree terminates a source symbol sequence that begins at the null-string root node of the tree. The source data stream is compressed by first recognizing sequences of source symbols in the uncompressed input data that correspond to nodes in the parse-tree and then transmitting the index (code symbol) of a memory location corresponding to the matched node. A decoder dictionary is typically constructed from the parse-tree to recover the compressed source sequence in its original form. The Ziv-Lempel parse-tree continuously grows during the encoding process as additional and increasingly lengthier sequences of source symbols are identified in the source data stream, thereby both adapting to the input data character and steadily improving the compression ratio.
The ideal Ziv-Lempel compression procedure is difficult to implement in practice because it requires an indefinitely large memory to store the parse-tree. Practitioners have introduced data structures designed to ease this problem, including the "TRIE" structure discussed by Kent Anderson ("Methods of Data Compression After The Manner of Lempel and Ziv", Optical Information Systems, January-February 1990). Terry A. Welch ("A Technique For Higher Performance Data Compression", IEEE Computer, Vol. 17, No. 6, pp. 8-19, June 1984) discusses data structures that improve the efficiency of the basic Ziv-Lempel technique, trading off compression efficiency for simplified implementation. Also, in U.S. Pat. No. 4,814,746, Victor S. Miller, et al disclose a variation on the Ziv-Lempel data compression method that improves compression efficiency using a fixed parse-tree size. However, the Miller, et al method employs a hash table that requires significant memory and processing time, thereby negating much of the speed advantage sought with hardware-based dictionaries.
The related art is generally documented by other practitioners and can be clearly understood with reference to Allen Clark's disclosure in European Patent Application 89306808.0 published on Jan. 10, 1990. Also, reference is made to Willard Eastman's disclosure in U.S. Pat. No. 5,087,913, which extends his earlier work disclosed in U.S. Pat. No. 4,464,650, and Terry Welch's disclosure in U.S. Pat. No. 4,558,302, all of which are entirely incorporated herein by this reference.
A fundamental problem presented by hardware-based compression systems is how to best exploit the speed advantage of hardware encoders and decoders while enjoying the compression efficiency offered by the Ziv-Lempel class of dictionaries. The parse-tree data structures proposed by Clark, Miller et al and Welch in the above-cited references offer some improvement in encoding and decoding speed but are generally intended for software implementation. Also, the Ziv-Lempel technique generally relies on continuing adaptation of the parse-tree responsive to an incoming source data stream, so the resulting dictionary must be continuously updated by software processes, a wasteful procedure for hardware-based systems.
Several practitioners, particularly Miller, et al cited above, have also considered the problem of poor compression efficiency during the early tree-building process. The Ziv-Lempel parse-tree is initialized either with the null-string root node alone or with the root node and a single set of source alphabet child-nodes. The initial parse-tree has only this inefficient dictionary with which to encode the early portion of the input data stream. Ziv, et al cited above showed that this early inefficiency is inconsequential over the long term. However, in databases where the input data stream consists of a series of individually-encoded records of relatively short length, Miller, et al cited above argue that up to one-third of the entire data stream can require more storage space in its encoded form than in its original form.
Consider, for example, that the very first 8-bit source symbols must be encoded as 12-bit encoder symbols in a parse-tree designed for Variable-to-Fixed (V-F) encoding of up to 4K extended-alphabet source symbols. If the parse-tree is restarred at the beginning of each record and the record is not long enough to eventually overcome the early parse-tree inefficiency by adding longer source symbol strings for conversion to single 12-bit encoder symbols, the encoded data may well require more storage space than does the source data.
Another problem presented by implementation of the Ziv-Lempel parse-tree is memory space limitations. In the above-cited patent, Miller, et al discuss the use of a replacement procedure that updates the dictionary responsive to recent samples of the source data stream without overflowing a fixed dictionary size. They test the dictionary for an empty slot and delete the least recently used (LRU) source-symbol string from the dictionary if no empty slot is found. Unfortunately, a simple LRU replacement scheme may eliminate an entry that was used many times, although not recently.
Implicit in such a replacement procedure is the understanding that both the encoder and decoder dictionaries are updated simultaneously in accordance with the modified parse-tree and that any data already encoded by the deleted entry is no longer in existence, having already been decoded. This is suitably assumed in a communication channel but is not likely in a database storage system. Miller, et al ("Variations On A Theme by Ziv and Lempel", Combinatorial Algorithms, A. Apostolico, et al, Eds., pp. 131-140, Springer Verlag, 19847) suggest that maintaining the parse-tree and dictionary data structures can be difficult when nodes and strings are to be deleted using the LRU strategy.
Thus, there exists a need in the art for an optimal strategy for creating a Ziv-Lempel dictionary adapted to the source data that can be stored for use in a hardware system for compressing data rabies of the type employed by database systems such as the International Business Machines Corporation Database 2. The related unresolved problems and deficiencies are clearly felt in the art and are solved by this invention in the manner described below.