This present invention relates generally to lossless data compression, and more particularly to code design and code selection for infinite symbol sets.
Lossless compression or entropy coding is used in communication and storage of digital information. In the method generally referred to as Huffman coding, see David A. Huffman, xe2x80x9cA Method for the Construction of Minimum-Redundancy Codesxe2x80x9d, Proc. of the IRE, vol. 40(10), September, 1952, doc. D1, the source data is separated into symbols 20 (see FIG. 1). The entropy coder 22 of FIG. 1 represents symbols 20 as variable length bit strings 26 (i.e., codewords) looked up in a codebook 24. Compression relies on assigning short codewords to frequent symbols and reserving longer codewords for infrequent symbols. The performance of a coder depends on its ability to match the probability distribution of symbols in the source with the probability distribution corresponding to the codebook.
A fundamental codebook design problem is how to design a codebook given the probability distribution of the source. Two methods are in general use, one for xe2x80x9cfinitexe2x80x9d codes and the other for xe2x80x9cinfinitexe2x80x9d codes. Finite codes can represent a known number of symbols, each symbol having a known non-zero probability of occurrence. Infinite codes, on the other hand, are infinite in the sense that they are xe2x80x9cinfinitely expandablexe2x80x9d, i.e., a given infinite code can be expanded as far as needed to represent a desired set of input symbols. To be a valid infinite code, no codeword can be a prefix for any other codeword. An xe2x80x9cinfinite symbol setxe2x80x9d is a set of symbols, which can be finite, that is represented by an infinite code.
Finite codebooks can be designed using the Huffman algorithm to generate a codebook matched to the source statistics. This algorithm constructs the set of codewords by starting from the least probable symbol and moving upward in probability. One suboptimal but reduced-complexity finite Huffman coding example is disclosed in U.S. Pat. No. 4,560,976, entitled xe2x80x9cData Compressionxe2x80x9d and issued Dec. 24, 1985 to Finn. Finn describes a finite codebook consisting only of one-, two-, and three-subword-long codewords. Finn uses a symbol-appearance-counting method to continuously re-rank his symbols and to decide which symbols should be assigned to which of his three lengths of codewords.
Finite-codebook methods such as those above are not well suited to infinite codes. One reason is that symbol appearance counting will not generally be able to adequately estimate the probability of occurrence of a large number of symbols having non-zero but near-zero symbol probabilities.
Several families of infinite codes have been developed based on regular structures or algorithms. Golomb codes are one such family of codes. See S. W. Golomb, xe2x80x9cRun length encodingsxe2x80x9d, IEEE Trans. Inform. Theory, vol. 12, pp. 399-401, July, 1966. Geometrically distributed (i.e., exponential) codes are another such family. See R. Gallager and D. C. Van Voorhis, xe2x80x9cOptimal source codes for geometrically distributed integer alphabetsxe2x80x9d, IEEE Trans. Inform. Theory, vol. 21, pp. 228-30, March, 1975. Other algorithmic infinite codes exist as well. See, e.g., N. Merhav et al., xe2x80x9cOptimal Prefix Codes for Sources with Two-Sided Geometric Distributionsxe2x80x9d, IEEE Trans Inform. Theory, vol. 46, pp. 121-35, January, 2000; N. Merhav et al., xe2x80x9cThe LOCO-I Lossless Image Compression Algorithm: Principles and Standardization into JPEG-LSxe2x80x9d, IEEE Trans. Image Processing, vol. 9, No. 8, August, 2000.
The regular-structured algorithmic approach provides a way to generate an infinite number of codewords but gives limited control over the resulting probability distribution. Another publication describes a method to generate an optimal codebook for an infinite distribution through a recursive sequence of calculations. A. Kato et al., xe2x80x9cHuffman Coding with an Infinite Alphabetxe2x80x9d, IEEE Trans. Inform. Theory vol. 42, pp. 977-84, May, 1996. This approach serves more as a theoretical construction since the complexity of encoding and decoding this code can be significant.
An alternative to matching a probability distribution with a fixed code is to use adaptive coding. In adaptive coding, the codeword used to represent a symbol is changed dynamically while the encoder operates. Changes in the encoder are either signaled by introducing overhead, or tracked from the decoded data. Overhead subtracts from the compression efficiency, and tracking the encoder state from decoded data is error prone. In both cases, additional complexity is introduced into the encoding and decoding operations.
Arithmetic coding can also be used to efficiently code a general probability distribution but the additional complexity of an arithmetic coder can be a drawback in some applications.
The present disclosure addresses the problem of developing an infinite code tailored to a given probability distribution. Using the methods described herein, it is possible to generate a low complexity code that provides good compression performance.
Generally, the described embodiments model a given probability using a combination of two or more existing infinite codes. Codes of low complexity are preferred. For instance, two Golomb codes may be selected, one that matches the source distribution well for symbol indices near zero, and another that matches the source distribution well for large symbol indices. The two codes are xe2x80x9cgraftedxe2x80x9d together such that the first is used for the symbol indices near zero, and the second is used for all other indices. When properly grafted as taught herein, only a few parameters are needed to describe such a distribution.
The disclosure generalizes the idea as explained simply above to accommodate a wide range of existing infinite codes and any desired number of code intervals. The described embodiments pertain generally to constructing a grafted codebook, encoding symbols using a grafted codebook, and decoding codewords formed using a grafted codebook.