Data compression can be used to write all of human knowledge on a pin (in theory, at least). Here's how: take a pin (FIG. 3, 300) and divide it into twenty-six segments (one segment for each letter, from A to Z). Make a mark 310 along the pin, in the segment corresponding to the first letter of all of human knowledge. In this example, the first letter is ‘T’. It doesn't matter where in the ‘T’ segment the mark is made; anywhere between the end of ‘S’ and the beginning of ‘U’ is fine. The pin now contains the first letter of all of human knowledge.
Next, divide the ‘T’ segment into twenty-six segments (expanded view 320, showing the subdivided ‘T’ segment, the preceding ‘S’ segment 330 and following ‘U’ segment 340). Adjust the mark 350 so that it falls somewhere in the segment corresponding to the second letter of all human knowledge. In this example, the second letter is ‘H’.
Repeat this process to encode the rest of the letters of all human knowledge. Doubly-expanded view 360 shows the ‘H’ segment, between ‘G’ 370 and ‘I’ 380, where the mark 390 has been adjusted to indicate that the third letter of all human knowledge is ‘E’. Note that this procedure only works to write on the side of the pin. Writing on the head of a pin requires a different technique.
To recover the message written on the pin, one simply measures—very, very carefully—how far along the pin the mark is located, and works through the same segment-dividing method to extract the message. Since so much information (all human knowledge!) is encoded in a single mark on the pin, the information has been compressed.
In the real world, of course, the physical properties of matter prevent this method from being used to write more than a few letters on any reasonably-sized pin. However, there is no mathematical reason why ideal intervals cannot be subdivided indefinitely, and the foregoing method is a simplified form of entropy encoding compression, a viable and effective way of compressing data.
Traditional entropy encoding compression algorithms (such as Huffman coding, adaptive Huffman coding or arithmetic coding) depend on having a statistical model of the input stream they are compressing. The more accurately the model represents the actual statistical properties of symbols in the input stream, the better the algorithm is able to compress the stream. Loosely speaking, the model is used to make a prediction about what input symbol will come next in the input stream. For example, if the input stream is English-language text, the model would assign a higher probability to the letter ‘e’ than to the letter ‘Q’ (usually). This corresponds physically to making the ‘e’ segment of the pin longer than the ‘Q’ segment.
The probability model can be static (i.e., unchanging for the duration of a compression process) or adaptive (i.e., evolving as the compressor processes the input data stream). The probability model can also take into account one or more of the most recently encountered input symbols to take advantage of local correlations. For example, in English text, encountering a letter ‘Q’ or ‘q’ in the input stream makes it more likely that the next character will be ‘u’.
An adaptive model typically works by matching the current input symbol against its prediction context, and if it finds the current input symbol in its context, generating a code representing the particular probability range that the input symbol represents. For example, if the current input symbol is ‘e’ and the model predicts that the probability of ‘e’ is in the range 0.13 to 0.47, then the compressor would generate an output code representing that probability range. This corresponds, in the physical example, to adjusting the mark to fall within the noted range of the current sub-segment of the pin. Once the symbol is encoded, the compressor updates the probability model. This “code and update” cycle is repeated until there are no more input symbols to compress.
When the compressor encounters a new symbol for which its model has no prediction, it must do something else. What it does is encode a special “escape” symbol to signal to the decompressor that the next symbol is a literal value. This escape-and-literal output has two problems: first, since the literal value will be sent as a whole number of bits (e.g., eight bits), compression algorithms that can send symbols in fractional bits will lose a small amount of compression efficiency when a fractional bit is rounded up to a whole-bit boundary. Second, the fact that these escaped literal symbols failed to match the current probability model provides some residual information that could be used to compress the literals themselves. This compression opportunity is wasted if the literals are distributed throughout the compressed data stream wherever the unmatched symbols happen to appear. Compression algorithm modifications to plug these efficiency “leaks” may be able to improve overall compression performance.