1. Field of Invention
This invention relates to data storage, specifically to an improved data compression method.
2. Description of Prior Art
Data compression gives data processing systems several performance advantages over non-compressed systems:
1. It allows larger data sets to be contained entirely in main memory. This allows faster processing than systems that must access the disk.
2. It allows a task to be performed while processing fewer bytes. This further speeds processing.
3. It provides a more compact form for archival, transmission, or reading/writing between memory and disk.
Previous patents have described variants on a hierarchical compression scheme. It is necessary to first describe the approaches used in prior art. FIGS. 1 and 2 illustrate the data structures of a scheme that represents features common to the following U.S. Pat. Nos. 5,023,610 (1991), 5,245,337 (1993), 5,293,164 (1994), 5,592,667 (1994), 5,966,709 (1999), 5,983,232 (1999).
FIG. 1 shows a set of records to be compressed. Each record has four fields: City, First Name, Last Name, and Shoe Size. These fields can be considered to be four parallel data input sequences. Each sequence is an ordered set of values for one field over the record set.
FIG. 2 shows the tree structure used in prior art to represent this record set. At the bottom of FIG. 2, the “leaves” of the tree are dictionaries (50) that each correspond to one field in the record. A dictionary contains one entry for each unique value of the corresponding field. The entry is the unique value and a count of the number of times the value occurred in the stream of values from the field. For example, in the City dictionary, the value for one entry is “Detroit” and its count is 6.
When a value is encountered that was previously seen in the input sequence, it is not added to the dictionary. Instead, the count associated with that value in the dictionary is incremented.
A token is the (zero-based) order of a value in a dictionary. A token uniquely identifies a value in a dictionary. For example, the tokens 0 and 1 identify the values “John” and “Bill” respectively, in the First-Name dictionary in FIG. 2.
The nonleaf nodes (“interior nodes”) (51) in FIG. 2 represent tuples of tokens from lower (leaf or interior) nodes. Here, for simplicity, the tuples are all pairs of tokens, each consisting of a left and a right member. (Higher-order tuples could also be used).
Each interior node here maintains a list of token pairs from its left and right child nodes. For example, the interior node above the Last-Name and Shoe-Size leaves (52) contains pairs of tokens from these fields' dictionaries on the left and right. These token pairs are in the order their corresponding values were first seen in the Last-Name and Shoe-Size input sequences.
For example, the first left/right pair in (52) is (0, 0). This denotes token 0 from the LastName dictionary and token 0 from the Shoe-Size dictionary. This stands for the values “Smith” and “9” for the Last-Name and Shoe-Size fields in the first record.
Likewise, the second left/right pair in (52), 1, 1), stands for the value pair, (“Doe”, “8”), in the second record. Each has a count of 1. If a token pair is the same as one recorded earlier, a new entry is not made. Instead, the count for that pair is incremented.
Each unique left/right token pair in an interior node is also assigned a token, representing the order that pair was first encountered from the left and right child nodes. For example, the tokens 0 and 1 is in (55) correspond to the left/right token pairs, (0, 0) and (0, 1).
Likewise, the root node, (56), represents unique token pairs from nodes (52) and (55), in the order they were first encountered. The root node (56) represents every unique record in the tree. For example, to reconstruct the third record, we look at the third token (token 2) in the root node (52). This has left and right values of 1 and 2.
We look up token 1 in the root's left child (55) and get left and right values of 0 and 1. This are tokens for values in the City and First-Name dictionaries, respectively, which are “Plymouth” and “Bill”, from the third record. A similar lookup with token 2 in interior node (52) gives “Smith” and “7” for the rest of the record.
Note that all the counts for token pairs at the root node are 1. Also note that there are consecutive runs where the left and right token numbers are each one more than the left and right token numbers in the previous entry. This can be seen for tokens 1 through 10 in the root node (56). For any given token in this range, we can get the left and right pair of the next token in the range by adding left and right elements of the given token.
For example, in node (56), the left/right pair for token 5 (4, 5) has each element one more than the left/right pair for token 4 (3, 4).
Exhaustively representing all the token pairs in a sequence with such a regular pattern wastes a considerable amount of space.
Also if other trees are constructed to represent subsets of the data in the first tree, the dictionary values used must be duplicated in the leaves of these other trees. This redundancy also wastes a considerable amount of space. It also wastes the processing time it takes to duplicate the dictionaries.
U.S. Pat. No. 5,966,709 (1999) described a method of optimizing said tree structure. Said method used a variant of the Huffman Algorithm, which can produce sub-optimal tree designs when the value function is complex or nonmonotonic. Said method also calculates the exact size of a parent node by counting the tuples formed by the child nodes joined, which is computationally expensive.
U.S. Pat. Nos. 5,023,610 (1991), 5,245,337 (1993), and 5,293,164 (1994) described the compression of a single stream of data, while this invention describes the compression of multiple parallel sequences of data.