Compression techniques, such as some variations of Lempel-Ziv (LZ), reduce an input string into a self-referencing “dictionary” where a second occurrence of a string of characters is replaced by a copy command referencing a prior occurrence of the string as a number of characters (i.e., a length) at a certain offset prior to a current position. For example, the string “bcdezbcde” could be compressed to the string “bcdez” followed by a command to copy a length of 4 characters starting at an offset 5 characters previous to the current position (i.e., “Copy(4,5)”). If the “Copy(4,5)” instruction takes fewer bits to output than the corresponding string of characters, the output is more compressed than the input.
The result of compression is sometimes called a “bitstream” because there are no fixed character boundaries in the result, and the quantities encoded in the bitstream are often variable in size. Generating a bitstream can be done as one or more passes. For example, a first pass can create a hash table indicating all occurrences of three-character or four-character strings in the input string. The hash table allows matches to be found more quickly. A second (compression) pass uses the hash table to find the best “matches” earlier in the input string for any repeated sub-strings. Based on the matches, some of the characters (i.e., repeats of earlier sub-strings) are replaced with copy instructions, creating the self-referencing dictionary. A third pass then turns the remaining characters and the copy instructions into an output bitstream. In between the second and third passes, statistics are used to determine how to best generate the output bitstream (generally, best means with the fewest bits). For example, the set of characters used in the input can be compressed in a number of ways. Knowledge of the distribution of lengths and offsets in the copy instructions can be used to compress those values.
While the compression of characters themselves can be done easily and efficiently (such as using tables based on known statistical occurrences of characters), the efficient compression of the lengths and offsets also needs to be done efficiently. Table-based approaches are very data-specific and do not produce good results across a wide range of data patterns. Even the conventional approach of using multiple tables and selecting the best result can be sub-optimal.
It would be desirable to implement optimized bitstream encoding for compression.