Building an adaptive Huffman codeword tree by analyzing the symbols in a document to be compressed to determine their frequency, grouping the frequencies into bands and assigning, to all of the symbols in each band, codewords having the same number of bits.
Huffman coding is a system of compression where symbols used most frequently are assigned codewords having the smallest number of bits, where symbols can be sets of characters, words, bytes or the like. Thus, to use words in a document as an example, a commonly used long word will be assigned a codeword having a smaller number of bits than a less frequently used short word.
To make a permanent table for the compression of text documents, for example, a typical sampling of text is analyzed, the frequency of each word is determined, and a permanent table is constructed where the most frequently used words in the sampling are assigned the shortest code words.
An improvement is to make the system "adaptive" by tailoring the specific set of codewords to the specific document being transmitted. In other words, assign codewords, not to the words that are used in a typical sampling, but to the words that are actually used in the particular document. In this case the compressor builds a table that is optimized for the document, and also sends the code table to the decompressor to decode it. In this way, words not found in the actual document need not be assigned a codeword.
The Huffman encoding process has two general steps. The first is to form groups of symbols based on frequency of use, and assign each group to a corresponding group of codewords which have the same number of bits, the result being the building of a Huffman tree. At this point there may be any number of groups. However, the industry standard is for there to be no more than 15 bits in the largest set of codewords, and, of course, for there to be no more symbols in each group than there is capacity for. If either of these limits is exceeded, then the second step is to adjust the symbols down into the smaller codewords, where space is available, to reduce the size of the largest codeword and to reduce any overflow in a single group.
This type of Huffman coding is used in GZIP shareware which is an industry standard, and is described in The Data Compression Book by Mark Nelson and Jean-Loup Gailly and also in the GZIP software code and comments.
One problem with the Huffman encoding process is its complexity which results in costly hardware and time consuming processing.