Ziv-Lempel compression compares the next characters from an input data stream to strings in a dictionary until the longest matching string is found, and it then outputs a code for the string, usually an index of the position of the string in the dictionary. Dictionaries have commonly been adaptive: when the longest matching string is found, a new string consisting of the matched string plus one or more additional characters is added. The adaptive process is such that it can be repeated during expansion, provided that the data is expanded in the order in which it was compressed. An adaptive dictionary may grow without bounds, which increases the number of bits needed to express its indices; may grow to a predetermined size, after which it stops being adapted; or may have entries deleted from it to make room for new entries, with the deletion commonly being done by a least-recently-used algorithm. There are various ways of representing a dictionary in storage.
An article entitled "Compression of Individual Sequences via Variable Rate Coding," by Ziv and Lempel, published in September, 1977 in the IEEE Transactions of Information Theory IT-24, pages 530-536, discloses the basic Ziv-Lempel algorithm. A dictionary begins with a single null entry. When the longest string S that matches the next characters from the input is found in the dictionary, then a new entry S+c is formed, where c is the input character after the string that matched S, a code for S and the uncompressed character c are emitted as output data, and matching of input characters is resumed beginning at the input character after c. There is the disadvantage that the c characters in the output data are not compressed.
U.S. Pat. No. 4,464,650 to Willard L. Eastman, et al, issued Aug. 7, 1984, discloses an initially null dictionary (called a search tree) in which, after a match on S, a new entry S+c is formed, with matching then resumed at the character after c. Characters of an alphabet of predetermined size are assigned position numbers in accordance with the order in which the characters are first encountered. A new entry is assigned the next available entry number (called a label), and the next available alphabet-sized set of indices (called virtual addresses) is assigned to the positionally ordered possible future dependent entries of the new entry S+c. The jth potential child of node i has the index iA-(A-j)+1, where A is the number of characters in the alphabet. For example, with a four-character alphabet, the null root node has number and index 1, and its four potential children have the indices 2-5. A child will be assigned a number if and when the child is created. A hash table correlates entry numbers to indices during compression or indices to entry numbers during expansion.
When S is matched and entry S+c is formed, a coded form of the index of S+c is emitted, and a coded form of c is emitted if this is the first encountering of c. A coded form of c is not required to be emitted if c has already been encountered because then c is determinable from the index of S+c. Note that the dictionary has many more indices than nodes, which is why encoding of the indices is required.
U.S. Pat. No. 4,558,302 to Terry A. Welch, issued Dec. 10, 1985, discloses a dictionary that optionally may be initialized with all characters of an alphabet (and it is assumed here that it is so initialized). After a match on S, a new entry S+c is formed, with matching then resumed AT c. The index of S is emitted, but c is not emitted since the value of c will be known by means of the index of the next match since c will be the first character of the next match. A dictionary entry contains simply the index of a prefix (S) and an extension character (c).
During compression, the index of entry S+c is found by hashing the index for S and the character c. During expansion, when entry S+c is identified by an index in the compressed data, c is extracted from the S+c entry, and then the index of S in the S+c entry is used to access the S entry; hashing is not required during expansion. U.S. Pat. No. 4,464,650 (Eastman) is cited as being unsuitable for high-performance implementations because of utilizing time consuming and complex mathematical procedures such as multiplication and division to effect compression and expansion (column 3, line 44).
U.S. Pat. No. 4,814,746 to Victor S. Miller, et al, issued Mar. 21, 1989, assigned to the same assignee as the present application, discloses similarly to U.S. Pat. No. 4,558,302 (Welch) and also discloses elimination of dictionary entries, to make room for new entries, by means of a least-recently-used algorithm that may delete entries having no dependent entries (leaves of the tree that is the dictionary). The Miller patent also discloses formation of a new entry from S'+S, where S is the current match and S' is the previous match. After a match on S, matching is resumed at the character following S in the input data stream. Formation from S'+S hastens adaptation to long strings. The embodiment includes a discriminator tree and an array of strings that is the actual dictionary.
An entry in the string array represents either a single character (it contains the character) or S'+S (it contains pointers to the S' and S entries).
A node in the discriminator tree points to a string array entry and contains the length of the represented string. The discriminator tree is traversed during matching by hashing the current node and the next input character after the string whose length is given by the current node. A final match may be on either the array entry designated by a discriminator node or the S' prefix of that entry.
A system in which a child node always represents only one extension character (a character on the right of the prefix represented by the parent) is called character extension. A system in which a child may represent multiple extension characters is called symbol extension.
European Patent Application 350,281 by Alan D. Clark, filed Jul. 4, 1989, forms a new entry from S+c and structures the dictionary as a tree. It discloses a down pointer in a parent node to the first child of the parent, a right pointer in a child to the next sibling of the child, and a parent pointer in each child to the parent of the child, with the parent pointer necessary only for expansion.
A paper by H. D. Jacobson, titled "Some Measured Performance Bounds and Implementation Considerations for the Lempel-Ziv-Welch Data Compaction Algorithm," in International Telemetering Conference Proceedings v 28 1992, published by International Foundation for Telemetering, Woodland Hills, Calif., describes a character extension 2K-entry dictionary structured as a 2K times 256 array of 11-bit entries. This structure permits any of 256 possible child nodes of a parent node, each child representing a different extension character, immediately to be tested for existence and located.
All of the above referenced patents and paper pertain to an adaptive dictionary that is useful for compressing and expanding long sequential data streams for either archiving or network transmission. Data must be expanded in the order in which it was compressed so that the the dictionary during expansion will have, for each string processed, the same contents it had during compression. U.S. Pat. No. 5,087,913 to Willard L. Eastman, issued Feb. 11, 1992, uses the same dictionary (search tree) and adaptive entry-formation processing as in the above referenced U.S. Pat. No. 4,464,650 (Eastman), but it discloses entry formation by a preprocessor from a sample of the data to be compressed, and then freezing of the dictionary (no further adaptation) when either the sample is exhausted or the storage space for the dictionary is full. The advantage is that after an input data stream has been compressed, individual short records in the compressed data can be expanded and examined and possibly changed and recompressed in random order, which is appropriate for a data base of records that are constantly being read and updated in random order.