Data compression while preserving sort ordering makes practicable the compression of certain types of data. For example, the data in a data base may consist of two thirds data records and one third keys corresponding to the data records. The keys are normally processed extensively by means of less than, equal to, or greater than comparisons in order to locate corresponding records. While the data records often are compressed to save space, the keys are commonly not compressed because of the processing time that would be required to uncompress them in order to do said comparisons. If the same said comparisons could be done on compressed keys as on uncompressed keys, it would be practicable to compress the keys in order to save space beyond that saved by compressing just the data records, and it would also reduce processing time because the compressed keys would be shorter than the uncompressed keys.
Said comparisons can be done on compressed keys if the sort ordering of the keys has been preserved, that is, if a first compressed key has a lower value than a second compressed key when the corresponding first uncompressed key is lower in the collating sequence than the second uncompressed key.
Data compression has been done in the prior art by means of a Ziv-Lempel (ZL) method and a ZL dictionary, which are introduced as follows:
Ziv-Lempel (ZL) compression compares the next characters from an input data stream to strings in a dictionary until the longest matching string is found, and the method then outputs a code for the string, usually an index of the position of the string in the dictionary. In adaptive ZL dictionaries when each longest matching string is found, a new string consisting of the matched string plus one or more additional characters is added into the dictionary. The adaptive process is such that it can be repeated during expansion, provided that the data is expanded in the order in which it was compressed. An adaptive dictionary may grow without bounds, which increases the number of bits needed to express its indices; may grow to a predetermined size, after which it stops being adapted; or may have entries deleted from it to make room for new entries, with the deletion commonly being done by a least-recently-used algorithm. There are various ways of representing a dictionary in storage.
An article entitled "Compression of Individual Sequences via Variable Rate Coding," by Ziv and Lempel, published in Sep., 1978 in the IEEE Transactions of Information Theory, Vol. IT-24, No. 5, pages 530-536, discloses the basic Ziv-Lempel algorithm. A dictionary begins with a single null entry. When the longest string S that matches the next characters from the input is found in the dictionary, then a new entry S+c is formed, where c is the input character after the string that matched S, a code for S and the uncompressed character c are emitted as output data, and matching of input characters is resumed beginning at the input character after c. There is the disadvantage that the c characters in the output data are not compressed.
U.S. Pat. No. 4,464,650 to Willard L. Eastman, et al, issued Aug. 7, 1984, discloses an initially null dictionary (called a search tree) in which, after a match on S, a new entry S+c is formed, with matching then resumed at the character after c. Characters of an alphabet of predetermined size are assigned position numbers in accordance with the order in which the characters are first encountered. A new dictionary entry is assigned the next available entry number (called a label), and the next available alphabet-sized set of indices (called virtual addresses) is assigned to the positionally ordered possible future dependent entries of the new entry S+c. The jth potential child of node i has the index iA-(A-j)+1, where A is the number of characters in the alphabet. For example, with a four-character alphabet, the null root node has number and index 1, and its four potential children have the indices 2-5. A child will be assigned a number if and when the child is created. A hash table correlates entry numbers to indices during compression or indices to entry numbers during expansion.
When S is matched and entry S+c is formed, a coded form of the index of S+c is emitted, and a coded form of c is emitted if this is the first encountering of c. A coded form of c is not required to be emitted if c has already been encountered because then c is determinable from the index of S+c. Note that the dictionary has many more indices than nodes, which is why encoding of the indices is required.
U.S. Pat. No. 4,558,302 to Terry A. Welch, issued Dec. 10, 1985, discloses a dictionary that optionally may be initialized with all characters of an alphabet (and it is assumed here that it is so initialized). After a match on S, a new entry S+c is formed, with matching then resumed AT c. The index of S is emitted, but c is not emitted since the value of c will be known by means of the index of the next match since c will be the first character of the next match. A dictionary entry contains simply the index of a prefix (S) and an extension character (c).
During compression, the index of entry S+c is found by hashing the index for S and the character c. During expansion, when entry S+c is identified by an index in the compressed data, c is extracted from the S+c entry, and then the index of S in the S+c entry is used to access the S entry; hashing is not required during expansion. U.S. Pat. No. 4,464,650 (Eastman) is cited as being unsuitable for high-performance implementations because of utilizing time consuming and complex mathematical procedures such as multiplication and division to effect compression and expansion (column 3, line 44).
U.S. Pat. No. 4,814,746 to Victor S. Miller, et al, issued Mar. 21, 1989, assigned to the same assignee as the present application, discloses similarly to U.S. Pat. No. 4,558,302 (Welch) and also discloses elimination of dictionary entries, to make room for new entries, by means of a least-recently-used algorithm that may delete entries having no dependent entries (leaves of the tree that is the dictionary). The Miller patent also discloses formation of a new entry from S'+S, where S is the current match and S' is the previous match. After a match on S, matching is resumed at the character following S in the input data stream. Formation from S'+S hastens adaptation to long strings. The embodiment includes a discriminator tree and an array of strings that is the actual dictionary.
An entry in the string array represents either a single character (it contains the character) or S'+S (it contains pointers to the S' and S entries).
A node in the discriminator tree points to a string array entry and contains the length of the represented string. The discriminator tree is traversed during matching by hashing the current node and the next input character after the string whose length is given by the current node. A final match may be on either the array entry designated by a discriminator node or the S' prefix of that entry.
A system in which a child node (child) always represents only one extension character (a character on the right of the prefix represented by the parent) is called character extension. A system in which a child may represent multiple extension characters is called symbol extension.
European Patent Application 350,281 by Alan D. Clark, filed Jul. 4, 1989, forms a new entry from S+c and structures the dictionary as a tree. It discloses a down pointer in a parent node to the first child of the parent, a right pointer in a child to the next sibling of the child, and a parent pointer in each child to the parent of the child, with the parent pointer necessary only for expansion.
A paper by H. D. Jacobson, titled "Some Measured Performance Bounds and Implementation Considerations for the Lempel-Ziv-Welch Data Compaction Algorithm," in International Telemetering Conference Proceedings v 28 1992, published by International Foundation for Telemetering, Woodland Hills, Calif., describes a character extension 2K-entry dictionary structured as a 2K times 256 array of 11-bit entries. This structure permits any of 256 possible child nodes of a parent node, each child representing a different extension character, immediately to be tested for existence and located.
All of the above referenced patents and paper pertain to an adaptive dictionary that is useful for compressing and expanding long sequential data streams for either archiving or network transmission. Data must be expanded in the order in which it was compressed so that the dictionary during expansion will have, for each string processed, the same contents it had during compression.
U.S. Pat. No. 5,087,913 to Willard L. Eastman, issued Feb. 11, 1992, uses the same dictionary (search tree) and adaptive entry formation processing as in the above referenced U.S. Pat. No. 4,464,650 (Eastman), but it discloses entry formation by a preprocessor from a sample of the data to be compressed, and then freezing of the dictionary (no further adaptation) when either the sample is exhausted or the storage space for the dictionary is full. The advantage is that after an input data stream has been compressed, individual short records in the compressed data can be expanded and examined and possibly changed and recompressed in random order, which is appropriate for a data base of records that are constantly being read and updated in random order.
U.S. Pat. No. 5,270,712 to B. Iyer, et al, issued Dec. 14, 1993, entitled "Sort Order Preserving Method for Data Storage Compression," assigned to the same assignee as the present application, teaches a sort order preserving method for a ZL type dictionary, and this patent is incorporated by reference herein in its entirety.
FIG. 1 herein shows art prior to U.S. Pat. No. 5,270,712. It shows a dictionary tree based on only the character symbols A, B, and C. It shows the following character strings in the dictionary, listed in sort order: A, AA, AC, B, C, and CB. It shows a code word assigned to each of those strings (code word 1 to A, 2 to AA, etc.) The code words assigned retain the sort order of the individual dictionary strings but result in loss of sort order for longer strings of concatenated dictionary strings. For example, the string AAA (not in the dictionary of FIG. 1) precedes a string ABA (also not in the dictionary) in sort order, but AAA parses, using the FIG. 1 dictionary, as AA and A resulting in code words 2 and 1 (represented as 2,1), while ABA parses as A, B, and A resulting in code words 1, 4, and 1, and, thus, the code word sequence for AAA (2,1) is higher in the sort order than that for ABA (1,4,1).
It should be noted that the method of determining the sort order of code word sequences is the same as for character strings and is as follows: Beginning at the beginning of the sequences (or strings), compare one element (code word) of one sequence to the corresponding element of the other sequence. If the two elements are equal, move on to the next pair of elements and repeat the comparison. If one element is missing because its sequence has ended, or if one element is lower than the other, the sequence having the missing or lower element is lower in the sort order.
FIG. 2 shows the solution to the code word sequence problem, described for FIG. 1, according to the method of U.S. Pat. No. 5,270,712. Code words are not assigned to interior nodes of the tree, only to leaf nodes. Nodes representing end of record (EOR) and other nodes representing a fictitious Zil-symbol (or "Zilch," represented as Z) are added to the tree. An EOR or Zilch is placed as the first child of each interior and root node. If the first-ordered symbol of the source alphabet is present as a child node, then the left-most child is designated as EOR, or otherwise as Zilch. If two adjacent children are assigned source symbols that are non-adjacent in the source alphabet, then these nodes are separated by a new Zilch node. Besides showing a tree, FIG. 2 also shows the meaning of all of the code words when they are encountered by a decoder during an expansion process.
The Zilch holds the proper position of the code word sequence so that, when followed with the next source character, which appears as the first character down from the root of the tree, the code word sequence will retain sort order.
FIG. 3 shows the encoding of the first 39 possible source strings of the A, B, and C alphabet when either the tree of FIG. 1 or the tree of FIG. 2 is used. Commas are used to separate the characters of a source string that are parsed together because of the strings available in each tree, and they are also used to separate the corresponding code words that are generated as a result of the parsing. It can be seen that the FIG. 1 codes do not retain sort order in all cases while those of FIG. 2 do.
U.S. Pat. No. 5,442,350, issued Aug. 15, 1995 to B. Iyer, et al, entitled "Method and Means Providing Static Dictionary Structures for Compressing Character Data and Expanding Compressed Data," assigned to the same assignee as the present application, discloses novel structures of separate static compression and expansion dictionaries, and it discloses compression and expansion processing that uses those structures. U.S. Pat. No. 5,442,350 is herein incorporated by reference in its entirety. Because the preferred embodiment of the present application is attuned to the requirements and capabilities of the separate dictionaries of the preferred embodiment of U.S. Pat. No. 5,442,350, that preferred embodiment will now be briefly described in some detail.
The dictionaries of U.S. Pat. No. 5,442,350 are structured as nodes of a downward growing tree stemming from an imaginary null root entry. The actual 256 topmost entries are children of the root entry, are numbered 0 through 255, and are called alphabet entries, with each alphabet entry representing the character whose code is the number of the entry. Each alphabet entry may be a parent entry (parent) having one or more child entries, with each child representing one or more additional characters. Each of those children may in turn be a parent, etc. The one or more characters represented by each entry are called extension characters since they are extensions on the right of the characters represented by the entries in the path from the subject entry up to the root of the tree.
Compression of an input character string is accomplished by using the value (the code) of the first character of the string as a number to locate the identically numbered alphabet entry, then matching the further characters of the string to the extension characters represented by the descendents of the alphabet entry until the last possible match is found, and then outputting as the compressed data the number, called an index, of the last matching entry. A dictionary has some power-of-2 number of entries, and the index of an entry is a bit string of a length equal to the power. For example, the index of an entry in a 4K-entry (4,096-entry) dictionary is 12 bits. With this example dictionary, an input string consisting of some number (the number determined by the last possible match) of eight-bit characters can be compressed to one 12-bit index. However, in the worst case, when the last possible match is on only the alphabet entry, the compressed data is actually larger than the input data since the number of the alphabet entry must be expressed as a 12-bit index.
The entries in the dictionary described above, including the alphabet entries, are called character entries. Each entry represents not only one or more extension characters but also the complete character string consisting of the extension characters of the subject entry and the extension characters, concatenated to the left, of all of the ancestors of the subject entry. That character string is called a character symbol. The index of an entry is sometimes called an index symbol. Expansion occurs by taking an index symbol, using it to locate the designated entry, and then outputting the character symbol represented by the entry. The character symbol may be contained in the designated entry, or it may be necessary to proceed upwards through the path of ancestors to collect the extension characters represented by them.
The compression dictionary (as described in U.S. Pat. No. 5,442,350) contains two types of entries: character entry and sibling descriptor entry. The first extension character (EC) represented by any character entry never appears in the entry. For an alphabet entry, the first (and only) EC is implied by the number (index) of the entry. For a nonalphabet entry, the first EC appears as a child character (CC) in the parent of the entry or as a sibling character (SC) in a sibling descriptor that is in the child list (the set of children) of the parent. If an entry represents more than one EC, the ECs after the first are called additional ECs (AECs) and appear in the entry.
A parent contains a child pointer (CPTR) that is the index of its first child. The other children of the same parent, if any, follow the first child contiguously in storage, meaning they have the next higher indexes. If a parent has more children than the number of byte positions available in the parent to contain child characters (CCs), a sibling descriptor follows the last child corresponding to a CC, the sibling descriptor contains the sibling characters (SCs) of the next children, and the next children follow the sibling descriptor. If there are more children than the number of byte positions available in the sibling descriptor to contains SCs, another sibling descriptor follows the children designated by the first sibling descriptor, etc.
The significance of child characters (CCs) and sibling characters (SCs) is that the initial step of attempting to match on a child of a parent can be performed simply by comparing the next character of the input string to each of the CCs in the parent and then the SCs in the sibling descriptors under the parent, without performing storage references to examine the children themselves. However, a child may finally need to be accessed, as will be described. The entries in the compression dictionary (and those in the expansion dictionary) are each eight bytes in length.
FIG. 4 illustrates much of the above description. FIG. 3 is slightly abstract since it does not show count fields and other bits, yet to be described, in the parent entries. FIG. 3 is described as follows. Parent A has children representing the character symbols AAD, ABF, and ACJK. The first extension characters of those children are A, B, and C and appear in the parent as child characters (CCs). AAD has one addition extension character (AEC), D, which appears in the entry. ABF has one AEC, F, which appears in the entry. ACJK has two AECs, JK, that appear in the entry. AAD and ACJK are themselves parents. Entries AADX, AADY, and ACJKZ do not have AECs or children and so contain nothing.
FIG. 5 is a similarly abstract illustration of a sibling descriptor in a child list.
The alphabetical ordering in the figures is only to make the figures easier to read. U.S. Pat. No. 5,442,350 does not require alphabetical ordering.
In the figures to be described next, the brackets around EC or SC indicate that the EC (extension character) or SC (sibling character) may or may not be present, and an ellipsis (three periods) indicates that the preceding field may be repeated.
FIG. 6 shows the three possible actual forms of a character entry in the compression dictionary. Each form begins with a three-bit child-count (CCT) field whose contents indicate the number of child characters (CCs) in the entry and whether a sibling descriptor follows the last child corresponding to a CC. When CCT is zero, the entry can contain zero to four additional extension characters (AECs), with the number of AECs being indicated in the three-bit AEC-count (ACT) field. When CCT is 1, there are a child pointer (CPTR) and one CC in the entry, and the entry can again contain from zero to four AECs. The entry also contains an examine-child bit (X) for the single child. This bit indicates, when one, that if there is a match on the CC, the matching process must be continued by examining (accessing) the child. The bit indicates, when one, that the child has either an AEC or a child, or both.
If there is a match on a CC (or SC) when the corresponding examine-child bit is zero, it is immediately known that the last possible match has been found.
When CCT is greater than 1 in a compression dictionary character entry, the entry can contain zero or one AEC, as indicated by the D (double-character) bit (which is a subset of the ACT field). If D is zero, the entry can contain two to five CCs, and a CCT of 6 indicates both that the entry contains five CCs and that a sibling descriptor follows the fifth child. If D is one, the entry can contain two to four CCs, and a CCT of 5 indicates both that the entry contains four CCs and that a sibling descriptor follows the fourth child. The entry contains an X bit for each of the five possible CCs. The entry also contains two examine-child bits (YY) for the last two children designated by the first sibling descriptor in the child list.
FIG. 7 shows a sibling descriptor. The entry contains a three-bit sibling-count (SCT) field whose contents indicate the number of SCs in the entry and whether another sibling descriptor follows the last child designated by the first sibling descriptor. An SCT of zero indicates that the entry contains seven SCs and that there is another sibling descriptor. The entry contains an examine-child bit (Y) for each of the first five SCs. The Y bits for the last two SCs are in the parent if this is the first sibling descriptor in the child list. If this is not the first sibling descriptor, there are no Y bits for the last two SCs, and the corresponding children must be examined if there is a match on those SCs.
There are two possible forms of a character entry in the expansion dictionary of U.S. Pat. No. 5,442,350, and FIG. 8 shows those two forms. A character entry begins with a three-bit partial-symbol-length (PSL) field whose contents indicate the number of ECs in the entry if the entry does not contain a complete character symbol.
An expansion dictionary character entry is called an unpreceded entry if PSL in it is zero. The entry contains a three-bit complete-symbol-length (CSL) field whose contents indicate the number of ECs in the entry, which can be from one to seven. The entry contains a complete character symbol.
An expansion dictionary character entry is called a preceded entry if PSL is greater than zero. PSL can be from 1 to 5, indicating the number of ECs in the entry. The entry also contains a predecessor pointer (PPTR) and an offset field (OFST). The entry contains either the rightmost or a right-hand part of a character symbol. Given that the next character symbol to be generated by the expansion process is to be placed beginning at a current position in the output area, the ECs from a preceded entry are to be placed at an offset from that current position as indicated by the OFST field in the entry. For example, if the entry contains five ECs and OFST is 255 (its largest possible value), the five ECs are to be placed at an offset of 255 from the current position in the output area (which indicates that the largest possible character symbol is 260 characters). After this placement has been done, the PPTR is used to access the predecessor entry, which may be either another preceded entry or an unpreceded entry. If it is another preceded entry, processing is as for the first one. If it is an unpreceded entry, the ECs in the entry are placed at the current position in the output area. The expansion of an index symbol is concluded when an unpreceded entry has been processed. At this time, if the index symbol designated a preceded entry, the pointer to the current position in the output area is incremented by the sum of the PSL and OFST in that first preceded entry. If the symbol designated an unpreceded entry, the pointer is incremented by the CSL in the entry.
The compression and expansion dictionaries each can contain 0.5K, 1K, 2K, 4K, or 8K entries.
Following is additional detail about the matching process in U.S. Pat. No. 5,442,350. After a match has been found on a parent entry, and if the parent has children, the next character of the string is compared in a left-to-right order against the CCs in the parent and the SCs in the sibling descriptors that are among the children of the parent until a match is found or all CCs and SCs have been compared against. If a match is found, the next characters of the string are compared against the AECs, if any, in the child designated by means of the matched CC or SC. If the AECs match, or if there are no AECs, the matching process is repeated using the matched child as the next parent. If there are AECs that do not match the next characters of the string, then, except in one case, the matching process is ended, with the match on the current parent being the last match. The one exceptional case is when the designated child and also the next child are in a set of children designated by means of consecutive identical CCs (not SCs) beginning with the first CC in the parent. In this case, an attempt is made to match on the following children in the set until either a match is found or all CCs for those children have been compared against.
The effect of the operation just described is that it is useful for two or more identical characters to appear as CCs in a parent or SCs in a sibling descriptor under a parent only when the characters are all consecutive CCs beginning with the first CC in the parent. In any case where the identical characters are not consecutive CCs beginning with the first CC in the parent, the second character and any subsequent identical CC or SC is wasted since they will never be compared against a string character equal to them. The rule that has been described in this paragraph is called the duplicate-CC rule.
The above description of the matching process points out another rule, which is called the AAB-before-AA rule. If parent A has children representing the character symbols AA and AAB, those children must be in the order AAB followed by AA; otherwise, if an input string AAB is matched against, there will always be a match on the AA child, and the AAB child is wasted. The AAB-before-AA rule is illustrated in FIG. 9.
The compression and expansion processing of U.S. Pat. No. 5,442,350 is performed by an instruction named COMPRESSION CALL (CMPSC). CMPSC uses general registers that contain the addresses and lengths of an input area and an output area. It uses another general register, register 1, that contains the address of a compression or expansion dictionary and a compressed data bit number (CBN). The CBN is a three bit number that designates the next bit to be processed within the next byte of the compressed data operand (since an index symbol can begin and end at any bit position within a byte). It uses another general register, register 0, that contains bits indicating whether compression or expansion is to be performed and the number of entries in the dictionary. When performing either compression or expansion, CMPSC processes the contents of the input area and places the results in the output area. CMPSC processes until either the contents of the entire input area have been processed or the output area has become full. CMPSC in effect processes one input record to produce one output record. CMPSC does not recognize or take any action on account of any special kind of end of record (EOR) indicator. This concludes the description of U.S. Pat. No. 5,442,350.
U.S. Pat. No. 5,323,155 to B. Iyer, et al, issued Jun. 21, 1994, entitled "Semi-Static Data Compression/Expansion Method," assigned to the same assignee as the present application, discloses means for determining when, and signalling to the receiving station that (so adaptation can be stopped at the station), a character extension adaptive dictionary is to transformed to a static compression dictionary so as to make use of hardware that compresses in accordance with U.S. Pat. No. 5,442,350. It also discloses a symbol translation means that deals with the requirement in U.S. Pat. No. 5,442,350 that some entries in the static compression dictionary cannot be character entries because they must instead be sibling descriptors, which requirement prevents a simple one-to-one transformation of entries in the adaptive dictionary to entries in the static dictionary.
The use of symbol translation by U.S. Pat. No. 5,323,155 is further explained as follows. Assume that the adaptive dictionary has 4K entries. Each of those entries is equivalent to a character entry, that is, it represents an extension character. (The subject adaptive dictionary uses only character extension, not symbol extension.) When the adaptive dictionary is transformed to one that can be used by the CMPSC instruction, the CMPSC compression dictionary most probably must contain some number of sibling descriptors and, therefore, cannot contain 4K character entries if it is only a 4K-entry dictionary. Therefore, the CMPSC compression dictionary must be an 8K-entry dictionary, with only some small number of entries (equal to the number of sibling descriptors) used within the second set of 4K entries.
For use with U.S. Pat. No. 5,323,155, CMPSC has an option (optional function) named the symbol translation option. The symbol translation option is active when an additional bit in general register 0 is one. When the option is active, the offset of the symbol translation table from the beginning of the dictionary whose address is in general register 1 is also in general register 1. When an index symbol is generated by the matching process when symbol translation is active, the symbol is used by CMPSC as an index into the symbol translation table (containing two-byte entries). The contents of the selected entry in the symbol translation table are then output as the code word that would have been output if the adaptive dictionary were still being used. Those contents are called an interchange symbol. This method of symbol translation is shown in FIG. 10. Symbol translation does not affect the expansion process since, in this example, a 4K-entry expansion dictionary can be formed that will translate an interchange symbol to the correct character symbol.