Computing devices perform various techniques for data compression to compress data bytes thereby using less memory and other computing device resources to store, process, maintain, and/or communicate data. Conventional data compression techniques may be inefficient from a processing resources standpoint and/or may be unreliable at finding data matches (e.g., repeated byte sequences) to compress the data. For example, a key challenge for any LZ77 compression implementation, such as LZX and LZMA, is to efficiently and reliably find the data matches that produce the smallest compressed data.
Various LZ77 compression algorithms attempt to determine repeated byte sequences and encode the matches with a (distance, length) pair. As a compression algorithm processes a buffer from beginning to end, at each position, the possible matches are the byte sequences from earlier in the buffer that are the same as the bytes at the current position of the buffer. Shorter distances back into the buffer can be encoded with fewer bits, while longer lengths cover more data. A distance indicates the distance in bytes between data matches in the buffer, and the length indicates the number of data bytes that match. To achieve a good compression ratio, an algorithm should be able to enumerate the shortest distances for each possible length, for each position in the buffer. In order to be fast, the algorithm should not expend time enumerating matches that are not the shortest distance for their length. For example, in some position in a buffer, the full set of possible matches might be (distance=50, length=3), (100, 4), (120, 3), (150, 4), (200, 5). The algorithm would only enumerate (50, 3), (100, 4), and (200, 5) because the other two (120, 3) and (150, 4) are superseded by matches that are at least as long (e.g., lengths of 3 and 4), but closer in distance. In terms of optimization, the algorithm should quickly enumerate the Pareto frontier of matches, where the two optimization criteria are longer lengths and shorter distances.
The LZX algorithm uses a splay tree to determine compression matches and solve the problem. Splay trees are binary trees, where new elements are inserted at the root. This provides the property that the most-recent and therefore, the shortest-distance matches, are encountered first when the algorithm searches the tree to determine the matches. The algorithm performs poorly if the tree becomes unbalanced, such as if strings are inserted in alphabetical order, and in practice, the LZX algorithm scales poorly to large match histories.
The LZMA algorithm can use variants of hash chains, binary trees, and Patricia tries to determine compression matches and solve the problem. There are also techniques of space-efficient tree implementations that can solve the problem if they are modified with some notion of the most-recently-inserted data string at each node of the tree. However, these techniques are implemented to traverse a tree structure from the root of the tree down in hierarchy to the lower-level nodes, and they are suboptimal when the most recent match is also a long match.