Huffman codes (https://en.wikipedia.org/wiki/Huffman_coding) are variable length codes that represent a set of symbols with the goal of minimizing the total number of bits needed to represent a stream of symbols. They achieve this by assigning shorter length codes to symbols that occur more frequently, and longer codes to the rarer symbols.
One example of use of Huffman codes is in the “DEFLATE” compression algorithm, which forms the basis for formats such as gzip and Zlib, as well as Winzip and PKZIP. The DEFLATE data format consists of a series of blocks, compressed using a combination of the LZ77 algorithm and Huffman coding. Huffman coding is also used for other purposes, such as in JPEG, MPEG, and MP3 codecs.
The generation of Huffman codes, at least within the context of DEFLATE compression, consists in taking an array of histogram data (weights), where each entry is a count of the number of times that symbol or token appears in the output, and then computing a corresponding code length for that token that minimizes the dot-product of the weights and the token-lengths. Typically, the sum of the weights is guaranteed to be less than 64 k, so the weights can be stored as 16-bit integers. The time needed to compute the codes is a function of how many non-zero weights there are. In DEFLATE, there are up to 30 values for the distance codes (the D-tree), but there are up to 286 values for the Literal-Length codes (the LL-tree), so in general the time for the LL-tree generation is largest.
The classical way to compute Huffman codes uses a heap data structure (https://en.wikipedia.org/wiki/Heap_(data_structure)). This is fairly efficient, but traditional software implementations contain lots of branches that are data-dependent and thus hard for general-purpose CPU hardware to predict. On modern processors with deep pipelines or super-scalar execution, the cost of these branch mispredicts can become the performance limiter.