A dictionary coder, also sometimes known as a substitution coder, is any of a number of lossless data compression algorithms which operate by searching for matches between the text to be compressed and a set of strings contained in a data structure (called the ‘dictionary’) maintained by the encoder. When the encoder finds such a match, it substitutes a reference to the string's position in the data structure. Commonly used algorithms such as LZ77/78, LZW, LZO, DEFLATE, LZMA and LZX are geared towards finding small repetitions in the data that is to be compressed.
The problem with the aforementioned dictionary coders is that they build a dictionary of sequences of bytes processed, where each such sequence is assigned a codeword. Generally there is an upper limit on the number of codewords that can be used. When all codewords have been assigned to sequences of bytes the algorithm must decide how to proceed when it wants to add a new sequence to the dictionary. In many cases the algorithm will simply reset the mapping of codewords to sequences of bytes and restarts the compression process of the rest of the data as if the first part of the data had never been processed. This situation is triggered when there is a sufficiently large sequence of bytes with very little repetition of data.
When trying to compress a print stream using one of the aforementioned dictionary coders one may find that the dictionary coder resets quite often. Typically this occurs when there are large amounts of image and font data included in the print stream. Each time the dictionary coder processes some image data it will run out of codewords and perform a reset. This causes the dictionary coder to forget any sequences it has seen before the image data which may actually repeat itself after the image data. More importantly it also forgets the sequences found in the image data. The next time the dictionary coder hits an image (even if the image was an exact copy of the previous image encountered) it is treated by the dictionary coder as new data resulting in very little compression of the print stream.
Unfortunately many print streams actually contain large amounts of repeated data as each page of the print stream is generated from a template (either manually or automatically) that include similar text (such as address info, salutation, etc.) and imagery (such as logos or signatures). For instance in a direct mail application each page of the print stream may be a letter to a potential customer. Typically such letters are generated from a template where the only variable parts are the address and the salutation. Therefore the main text and imagery (logos, signatures, product photos, etc.) are often exactly the same for every recipient. Each page in the print stream will therefore have a large amount of text and image data encoded that is exactly the same on each page. It should therefore be possible and advantageous to compress such files considerably.
It is not uncommon that a print stream contains the print data for thousands of recipients. Storing such files on hard disk before sending them to the printer may therefore require large amounts of storage (without compression). Transferring such a print stream via a network to the printer may also take quite a while depending on the available network bandwidth. Although it is not uncommon for (internal) networks to use 100 Mbit or 1 Gbit connections, the network connections between multiple geographically separated locations have bandwidth limits imposed on them for cost reasons. Compressing the print stream may therefore be the only viable solution to transferring files across a bandwidth limited network in a reasonable time frame and at a reasonable cost.