1. Field of the Invention
This invention relates to procedures for storage and transfer of data in computer systems, and more particularly, to a method for extended alphabet data compression that decodes efficiently and preserves sort order in compressed form.
2. Description of the Related Art
Variable-length source coding is a well-known solution to the problems known for fixed-block data compression techniques, including large dictionary memory and extensive search requirements. The application of variable-length source coding methods to data storage compression leads to two new problems that do not concern practitioners in the data channel compression arts. These are (a) the loss of sort order among the stored records following compression, and (b) the code string termination problem, which relates to designation of record boundaries and the requirement to terminate strings at byte boundaries in byte-oriented storage systems.
The fundamental features of variable-length source coding are set forth by R. G. Gallager, "Information Theory and Reliable Communications", John Wiley and Sons, New York, 1968. Also, reference is made to Masud Mansuripur, "Information Theory", Prentice-Hall, Inc., Inglewood Cliffs, N.J., 1987, pp. 25-35. The general solution to the sort-order problem of compressed records in a storage resource has not heretofore been found, particularly because of the difficulties inherent in the code string termination requirement for padding the encoded record in a byte-oriented storage resource. Because there is little interest in the sort order of data in compressed form during transmission through a channel, little attention is given to this problem by practitioners in the data transmission art.
An alphabet is defined in the art as a set of symbols, e.g., the eight-bit per symbol EBCDIC alphabet. Some variable-length source coding techniques employ a form of source alphabet extension, which is a technique for extending a source alphabet to include "words" composed of more than one symbol from the source alphabet. The elements of the extended source alphabet or source words can be assigned code elements or code words related to the relative frequencies of the respective source words. This coding technique involves first "parsing" a string of source symbols into "words" of one or more symbols and then "encoding" each source word by substituting the corresponding code word. Extended source alphabets offer a speed advantage. Decoding is also relatively fast because a plurality of source symbols are decoded for each code word.
Extended source alphabets also provide the opportunity for additional compression through the use of higher-order symbol dependencies. A simple example of an extended alphabet is the "3-gram" block method proposed by Claude Shannon (see Shannon et al. "The Mathematical Theory of Communication", University of Illinois Press, Urbana (1964)). Another general example is the dictionary approach. In the prior art, a dictionary is a collection of words referenced by a "greedy parser", which determines in one pass the longest word in the dictionary corresponding to the next portion of a source string or record to be compressed by encoding. Practitioners have also developed techniques using "non-greedy" optimal parsers for a data string that can be parsed more than one way. Each word in such a dictionary is assigned an index and the dictionary is shared between the encoder and decoder processes.
An important form of variable-length source coding is known in the art as Variable-to-Fixed (V-to-F) coding. In V-to-F coding, a variable-length set of 2.sup.k source words is mapped to the set of fixed-length k-bit numbers. Because there are 2.sup.k such k-bit numbers, the dictionary approach becomes V-to-F coding if each dictionary index is a k-bit binary code word. As is well-known, the best compression in V-to-F coding occurs when the source word probabilities are equal; that is, when the probability of each word approaches the ideal value of 2.sup.-k. The corresponding code words can be decoded rapidly because the fixed-length k-bit code word serves as an index that retrieves from a table (dictionary) the variable-length source string that was encoded. These compression techniques also achieve some of the compression advantage of higher-order Markov techniques when the data string is actually a concatenation of source words.
An adaptive form of V-to-F encoding is the Ziv-Lempel technique described by J. Ziv et al. "Compression of Individual Sequences Via Variable-Rate Coding", IEEE Trans., (Vol. IT-24(5), pp. 530-536, September 1978). Reference is also made to the analysis of the Ziv-Lempel code compression capabilities provided by G. Langdon "A Note on the Ziv-Lempel Model for the Compression of Individual Sequences", IEEE Trans. Information Theory, Vol. IT-29(2), (March 1983). For a discussion of the popular "LZW" version of the Ziv-Lempel code, reference is made to T. Welsh, "A Technique for High-Performance Compression", Computer, June 1984, pp. 8-19.
In U.S. Pat. No. 4,464,650, W. L. Eastman et al. disclose an improved "LZW" version of the Ziv-Lempel compression algorithm wherein the source string is parsed into adaptively growing segments, each segment consisting of an exact copy of an earlier portion of the source string suffixed by one new symbol from the source string. The copy that is to be made is the longest possible and is constrained to coincide with an earlier parsed segment. By this method, a dictionary is "grown" in response to the source string characteristics, adaptively. Eastman et al. provide extensive discussion of an incomplete parse-tree encoding method and their patent is included herein in its entirety by this reference.
In the general dictionary approach, each symbol alphabet dictionary can be described as a "parse-tree", such as the tree shown in FIG. 1. To recognize the next dictionary word from a source string of source symbols, the parse-tree is entered at the root node and the path to the child of the root node having the first source symbol is taken. Next, if the second source symbol in the string matches a child symbol that is subtended from the current node corresponding to the first symbol, the path to the child node having the second symbol is traversed and this new node becomes the current node. At some point, a current node is reached that is either a leaf node, defined as a node without children, or is an interior node having some children but lacking a child node assigned the next source symbol in the string. At this point, the current node corresponds to the end of the longest word in the source string that is present in the source "dictionary". This node is assigned a code word from a code word dictionary. If this end node is not a leaf node, the parse-tree is said to be "incomplete" (e.g. FIG. 2).
If each interior node of the parse-tree has all source symbols as children, then the parse-tree is said to be "full" (e.g. FIG. 1). In a full parse-tree, all parsing operations terminate at a leaf node unless the end of the source string is first attained. Thus, the code word dictionary corresponding to a full parse-tree contains only leaf node code words. No code word in this dictionary is the prefix of another code word in the same dictionary. Any coding scheme having this absence of prefixes is referred to as a "prefix code".
Unfortunately, in the general case, the parse-tree has some interior nodes that are limited only to the most popular children and therefore the parse-tree is said to be "incomplete". Many important dictionary techniques, including the Ziv-Lempel compression method, result in incomplete parse-trees because adding all source symbols at each generation is unnecessarily expensive. The majority of these source symbols occur infrequently and the paths to their assigned nodes are seldom used. The infrequently used children nodes are omitted to permit efficient dictionary storage and rapid decompression of code words.
An "alphabetical" order can be postulated for any symbol alphabet and such order extends to any extended alphabet that represents the leaves of a full parse-tree. The leaves of a full parse-tree have an alphabetical or "sort" order according to the order of the corresponding symbol alphabet. A first approach to order-preserving codes may be found by referring to Gilbert et al., Bell System Technical Journal, 1959, pp. 933-967. Gilbert et al discuss "alphabetical-order encodings" having a "strong alphabetical" property where the source record order is preserved as the numerical order of their binary codes. Although an order-preserving encoding technique limited to full parse-trees can be deduced from the methods known to practitioners in the art (except for source words formed by EOR source string truncations), an efficient coding technique that preserves the sort order of source records in encoded form for incomplete parse-tree dictionary coding methods is unknown.
Much of the relevant prior art concerns methods for the compression of search keys, search indexes and pointer tables designed to point to the actual source record of interest. Such compressed indexes do nothing to preserve sort order in a file of compressed records. Indeed, in U.S. Pat. No. 3,651,483, W. A. Clark et al. disclose the use of a compressed index for pointing to an uncompressed record. That is, the actual records are never compressed.
W. A. Clark et al. also discuss the generation of compressed keys in U.S. Pat. No. 3,593,309 and E. Loizides teaches a similar compressed index method in U.S. Pat. Nos. 3,602,895 and 3,613,086. These teachings do not sort the compressed keys in terms of their lexical value. The compression techniques also do not consider the probability distribution of the keys but merely use shortened representations of each index obtained by dropping the prefix or higher-order portion of the index pointer in groups of adjacent indexes having identical higher-order terms.
In U.S. Pat. No. 4,782,325, Hakan Jeppsson et al. teach a key encoding method that reduces the redundancy ratio of the encoded key set. This again considers only keys and does not teach any method for compressing the actual records, which must be sorted in uncompressed form. Jeppson et al. use a Huffman code to generate a code word dictionary, thereby reducing the redundancy ratio by optimizing the encoding procedure.
In U.S. Pat. No. 4,386,416, Michael E. Giltner et al. teach a data compression method suitable for use in data transmission over a standard network. Giltner et al. also suggest the Huffman code for optimal dictionary generation and base their binary sequences on escape codes. Neither Jeppson et al. nor Giltner et al. consider the problem of sort ordering of records in encoded form.
Reference is also made to D. R. Hicks, "Compressing Sort Keys While Maintaining Sort Sequence", IBM Technical Disclosure Bulletin, Vol. 26, No. 1, June 1983, pp. 388-391. Hicks discloses a compression method that retains a sort-order property for a set of sort keys but teaches the use of run-length encoding of like source characters. This encoding method merely replaces a sequence of two or more repeated characters with two occurrences of the repeated character followed a count of the repetitions. Hicks neither teaches nor suggests how to use the more powerful adaptive V-to-F encoding method discussed above while preserving source record sort order in compressed form.
Reference is made to P. Franaszek, "Optimized Key Compression", IBM Technical Disclosure Bulletin, Vol. 16, No. 10, March 1974, pp. 3466-3470. Franaszek teaches a powerful key-compression algorithm based on a parsing tree wherein, at each internal node, a mechanism is provided for locating the sub-tree containing the key being searched. However, Franaszek's method does not preserve sort order following compression.
Thus, the powerful and effective compression methods known in the art that rely on incomplete parse-trees do not preserve sort order. Such order preserving techniques for data compression that are known in the art are relatively ineffective for preserving storage space. Accordingly, there is a clearly felt need in the art for a data compression method that combines the efficient compression features of V-to-F codes using incomplete parse trees with the source record order preserving features available from full parse-tree dictionary encoding techniques.
The unresolved problems and deficiencies related to incomplete parse-tree dictionary encoding data compression methods are clearly felt in the data storage arts and are solved by the present invention in the manner described below.