In computer systems, the operation of translating data is a frequent, and often time consuming task. Data can be a natural language text, a program, an index to a database, a set of numbers, or a representation of a physical phenomena, for example, an image. Translating data may consume a large amount of computer resources such as memory and processing cycles. During translation, data can be compressed. It is well known that compressed data can reduce the amount of computer resources consumed.
Many techniques are known for "lossless" data compression. Lossless means that the original data is bit-by-bit fully recoverable from the compressed data. However, most compression techniques do not preserve the ordering of the compressed data. There are important reasons for wanting data compression that is order preserving. Order preserving compression techniques facilitate both sorting and searching.
For example, key sorting is a method where one extracts the sort value, e.g., the key of each record of a database, and stores the key with a record pointer or address in an index as a set of character strings. The strings are sorted according to the value of the keys. The sorted index can be used to retrieve the records in the sorted order. This is much more efficient than tediously sorting the entire records.
Therefore, order preserving data compression can reduce the size of the key, and can speed-up comparisons and moves. There is an especially large payoff for compressing multi-field keys which are frequently extended to a fixed length with blank padding characters.
Arithmetic compression could be used for order preserving data compression. Arithmetic compression is based on knowing the probabilities of the data to be encoded. Arithmetic compression works by adding cumulative probabilities to the calculated results of prior encodings. The probability calculations preserve ordering when the cumulative probabilities are based on data organized in a sorted order. Arithmetic encoding techniques require slow-to-execute arithmetic instructions, and typically operate on the data to be compressed one byte at the time. However, in key sorting, where every key needs to be compressed prior to sorting, one needs a compression technique which works at a very high speed.
Dictionary approaches, which do simple look-ups on units, e.g., tokens, of multiple symbols, can be much faster than arithmetic compression techniques. Furthermore, a dictionary approach can be very closely tailored to a specific data compression problem. For example, one might compress only specific sequences of data for which there are dictionary entries.
For ordered data, it is important that the data compression technique be "static" and order preserving. In non-static or adaptive compression, an encoding dictionary can be built dynamically, detecting and adapting to localized frequency patterns as the data are compressed. Non-static compression can be used for large natural language texts, where there is no requirement to compare the compressed text with other compressed texts.
In contrast, if it is desired to compress many small sets of data, such as index keys, the compressed form of the data needs to deliver the same result upon comparison as the original uncompressed data. Therefore, the encoding dictionary must be built once, and remain unchanged, e.g., static. In this case, adaptation is not possible because no individual property of the data can influence the structure of the encoding dictionary.
Only a static compression technique will preserve the order over time, which is important in both searching and sorting. This rules out many of the most powerful adaptive techniques, such as those based on the well-known Ziv-Lempel method, "Compression of Individual Sequences Via Variable-Rate Coding" J. Ziv, A. Lempel, IEEE Trans. Information Theory, IT-24, 5 (Sept. 1978) 530-536. Such methods are more attuned to the compression of large natural language texts.
Also, the need to preserve order eliminates many dictionary techniques such as Huffman encoding. "The Art of Computing Programming", Vol. 1, Fundamental Algorithms, and Vol. 3, Sorting and Searching. D. Knuth, Addison Wesley (1973) Reading, Mass.
One systematic method of performing order preserving data compression is the so-called Hu-Tucker method, see Knuth cited above. In the Hu-Tucker method, a translation dictionary is built, usually in the form of an optimal weighted binary tree, where the weights of the nodes of the tree are based on the frequency of occurrence of the dictionary entries. The entries constitute a "dictionary" of tokens to be compressed. The weighted nodes cannot be re-arranged arbitrarily because the order of the compressed forms needs to be the same as the order of the original entries.
However, the Hu-Tucker method also has limitations. The Hu-Tucker method does not address the problem of how to parse the data. This problem needs to be solved so as to permit correct ordering of dictionary entries. In particular, one needs to solve the problem of ordering entries when one entry is a "prefix" of another second entry.
Because of limitations during the decoding of Hu-Tucker compressed data, the entries in the Hu-Tucker dictionary are required to observe the "prefix property". The prefix property holds that no entry of the dictionary can be a prefix to any other entry of the dictionary. Effective and flexible data compression can be better performed if the prefix property for the encoded data are not required.
Therefore, there is a need for a translating method which can be used for order preserving data compression. It would be advantageous if the translating method can also be applied to traditional non-order preserving compression techniques. Further advantages can be gained if the same translating method can be used for encoding and decoding data.