A variety of data compression algorithms derive from work published in Ziv, Jacob and Lempel, Abraham, "A Universal Algorithm for Sequential Data Compression," IEEE Transactions on Information Theory 23(3):337-343, May 1977. These algorithms are commonly referred to as LZ77 compression schemes. LZ77 compression schemes are based on the principle that repeated strings of characters can be replaced by a pointer to the earlier occurrence of the string. A pointer is typically represented by an indication of the position of the earlier occurrence (typically an offset from the start of the repeated string) and the number of characters that match (the length). The pointers are typically represented as &lt;offset, length&gt; pairs. For example, the following string
"abcdabcdacdacdacdaeaaaaaa" PA1 "abcd&lt;4,5&gt;&lt;3,9&gt;ea&lt;1,5&gt;"
may be represented in compressed form by the following
Since the characters "abcd" do not match any previous character, they are encoded as a raw encoding. The pair &lt;4,5&gt; indicates that the string starting at an offset of 4 and extending for 5 characters is repeated "abcda". The pair &lt;3,9&gt; indicates that the string starting at an offset of 3 and extending for 9 characters is repeated.
Compression is achieved by representing the repeated strings as a pointer with fewer bits than it would take to repeat the string. Typically, a single byte is not represented as a pointer. Rather, single bytes are output with a tag bit indicating single byte encoding followed by the byte. The pointers are differentiated from a single byte encoding by different tag bit value followed by the offset and length. The offset and length can be encoded in a variety of ways.
The efficiency of a compression technique can be measured by the time it takes to compress data and the time it takes to decompress the data. Another measure of efficiency is the amount of actual data compression that occurs. Generally, time of compression and amount of compression are inversely proportional. Various compression techniques have been developed in attempts to reach an optimal balance between time and amount of compression for a given situation. The book "Text Compression" by Bell, Cleary, and Witten, published by Prentice Hall, provides an overview of various text compression techniques and is hereby incorporated by reference. The time of compression for LZ77-based schemes is dependent primarily on the time needed to determine whether a given string is a repeat of an earlier string, generally referred to as string searching. It would be desirable to have a data compression system that minimizes the time of compression while allowing a considerable amount of compression. It would be desirable to have an encoding scheme that minimizes the number of bits needed for offset and length encodings.