Lempel-Ziv-Welch (LZW) compression is a dictionary-based compression algorithm; it has the distinct advantage that the dictionary is (i) created as the data is transmitted and (ii) tailored to the actual data. That is, when transmission starts, the dictionary contains only the standard ASCII characters. If the transmitted data includes the string “I think so”, then the compression algorithm adds “I”, “think”, “so”, “I think”, and “I think so” (and all of their substrings) to the dictionary and assigns each entry a shorthand code. When the compression algorithm sees that string again, or any of the strings stored in the dictionary, it just transmits the shorthand code. So, for “I think so”, rather than transmitting 10×8=80 bits, it might transmit a single 12-bit code. The code length depends on the dictionary size.
LZW compression works because most commonly transmitted data—text, spreadsheets, databases, etc.—contain a lot of repetition. Data with little or no repetition (for example, pure random numbers) do not compress. Some file types, such as PDF, JPG/JPEG, MP3, and ZIP have already been compressed, and LZW will not make them smaller. These file types can be created with Adobe Acrobat, digital cameras, MPEG encoders, and PKZIP, respectively. PKZIP and the V.42bis modem compression standard are examples of applications of LZW compression.
One obvious drawback of LZW compression is that the dictionary has a finite size; if the dictionary overflows then the compression effectiveness declines. For example, if you email a text document to two different people (two separate emails), the modem uses V.42bis and compresses the text from the first email. When the second email arrives on the heels of the first, the dictionary already contains the strings required to compress the document (remember, it is the same document), so the compression ratio is very high. But, if an email contains a text document and a JPEG file to each person, the modem uses V.42bis and compresses the text. However, the JPEG file cannot be compressed further, so the V.42bis compressor keeps adding more and more strings (bits of the JPEG picture) to the dictionary, until the dictionary no longer contains any of the original text file. When the second email arrives, the dictionary no longer contains any part of the text document, and has to begin all over; the compression ratio is therefore not as good.
This same problem can occur in Internet routers, where many different streams of data are sent to one user. For example, when a webpage is opened through a browser on a PC, the browser immediately starts downloading text, banner ads, text, pictures, text, etc. A brute-force compression algorithm (such as V.42bis) tries to compress everything, and may wind up compressing nothing because the dictionary keeps filling up with non-compressible JPEGs.