Traditional entropy encoding compression algorithms (such as Huffman coding, adaptive Huffman coding or arithmetic coding) depend on having a statistical model of the input stream they are compressing. The more accurately the model represents the actual statistical properties of symbols in the input stream, the better the algorithm is able to compress the stream. Loosely speaking, the model is used to make a prediction about what input symbol will come next in the input stream. For example, if the input stream is English-language text, the model would assign a higher probability to the letter ‘e’ than to the letter ‘Q’ (usually). This corresponds physically to making the ‘e’ segment of the pin longer than the ‘Q’ segment.
The probability model can be static (i.e., unchanging for the duration of a compression process) or adaptive (i.e., evolving as the compressor processes the input data stream). The probability model can also take into account one or more of the most recently encountered input symbols to take advantage of local correlations. For example, in English text, encountering a letter ‘Q’ or ‘q’ in the input stream makes it more likely that the next character will be ‘u’.
An adaptive model typically works by matching the current input symbol against its prediction context, and if it finds the current input symbol in its context, generating a code representing the particular probability range that the input symbol represents. For example, if the current input symbol is ‘e’ and the model predicts that the probability of ‘e’ is in the range 0.13 to 0.47, then the compressor would generate an output code representing that probability range. This corresponds, in the physical example, to adjusting the mark to fall within the noted range of the current sub-segment of the pin. Once the symbol is encoded, the compressor updates the probability model. This “code and update” cycle is repeated until there are no more input symbols to compress.
When the compressor encounters a new symbol for which its model has no prediction, it must do something else. What it does is encode a special “escape” symbol to signal to the decompressor that the next symbol is a literal value. This escape-and-literal output has two problems: first, since the literal value will be sent as a whole number of bits (e.g., eight bits), compression algorithms that can send symbols in fractional bits will lose a small amount of compression efficiency when a fractional bit is rounded up to a whole-bit boundary. Second, the fact that these escaped literal symbols failed to match the current probability model provides some residual information that could be used to compress the literals themselves. This compression opportunity is wasted if the literals are distributed throughout the compressed data stream wherever the unmatched symbols happen to appear. Compression algorithm modifications to plug these efficiency “leaks” may be able to improve overall compression performance.