1. Field of the Invention
The present invention relates to data compression, and more specifically to lossless data compression using variable length coding.
2. Relevant Background
Data compression is playing an increasingly important role in technology as more and more data is collected, stored, and communicated in the modern information age. High definition televisions, cellular phones, and compact disk players are just a few examples of everyday products which use data compression to represent data.
Data compression is the process of reducing a large amount of information into a smaller-sized representation of the information. The process of compressing the information can be either lossy or lossless. Lossy compression, sometimes called irreversible compression, means that some of the original information is lost during the compression process and thus the original data cannot be perfectly restored from the compressed data representation. Lossy compression is typically used to compress images and sounds where a small amount of data loss is generally acceptable to people viewing or listening to the restored information. In a lossless compression mechanism no data is lost during compression. Lossless compression is reversible since the original information can be perfectly reconstructed from the compressed data representation. Lossless compression is mandatory for many types of data and program code information, and is well suited for text and text formatting compression, as well as images and sounds.
Typically, digital data is stored in fixed length units of bits such as 8 bits (byte). One common method of lossless compression called variable length coding involves representing fixed length data with variable length codewords. If shorter length codewords are selected for the most frequently occurring data and longer length codewords are used to represent infrequent data, the average number of bits used is typically reduced. This technique, for example, is similar to Morse code where frequently occurring letters of the alphabet are represented by short codewords (an xe2x80x9cExe2x80x9d is xe2x80x9cxc2x7xe2x80x9d) and lesser used letters are assigned longer codewords (an xe2x80x9cXxe2x80x9d is xe2x80x9c-xc2x7xc2x7-xe2x80x9d). To restore the original message from the compressed message, the codewords are simply matched to the original letters using a lookup table. In a similar fashion, fixed length binary data can be compressed using variable length binary codewords to represent data. To restore the original binary message, the binary codewords are matched to the original binary data using a lookup table.
In order for variable length coding compression to work, the code must be uniquely decodable such that the original message can be decoded in one and only one way. Consider, for example, a code mapping of {x,y,z}={1,11,0}. This code is not uniquely decodable because it is impossible to determine from a compressed message containing two sequential one""s whether the original message has two x""s or a y. A uniquely decodable code is said to be a prefix-free code if every codeword can be decoded without having to read beyond the immediate codeword being examined. Thus, the binary code {x,y,z}={0,10,110} is a prefix-free code since reading a xe2x80x9c0xe2x80x9d in the present codeword indicates a codeword ending. On the other hand, the binary code {x,y,z}={0,01,011} is not a prefix-free code since a xe2x80x9c0xe2x80x9d could mean the message contains x, y, or z, and reading the next codeword is necessary to decode the present codeword.
A well known procedure for generating prefix-free codes is called the Huffman coding procedure. Huffman codes are typically generated using a message tree structure dependent on the probability distribution of each letter in the message (frequency of occurrence of the letter divided by the message length). By way of the message tree structure, letters having the highest probability distribution are assigned the shortest codes. Table 1 shows a sample Huffman code mechanism for a set of letters with a hypothetical probability distribution in a message.
Although Huffman coding is an effective technique for achieving high levels of data compression, it has the disadvantage of generally requiring a large lookup table to encode and decode data. A lookup table is needed to reverse the compression process and restore the original data from the coded data. Thus, the lookup table must typically be stored alongside the compressed data, decreasing the effective compression factor. For example, to represent about 128 codewords, one may need codes that are as much as 16 bits in length. This will require a table of size 64K entries. The table size may be reduced by hashing or exploiting the properties of the particular Huffman code used, but such a reduction increases compute time and severely limits the code""s adaptability. A large lookup table can also make Huffman coding prohibitive in many embedded systems applications, where the memory available is relatively small. Furthermore, when the characteristics of the data change, the code""s optimality is lost and hence a new table may be needed, decreasing the compression efficiency and possibly compute performance.
To avoid storing large lookup tables, Golomb coding techniques have been developed. Golomb codes can be thought of as a special set of variable length prefix-free codewords optimized for non-negative numbers having an exponentially decaying geometric probability distribution. The codewords are constructed such that they can be directly decodable without the need of a lookup table.
Golomb codes are composed of two parts: an integer portion of n/m represented using a unary code, and a n modulo m portion represented using a binary code, where n is a non-negative integer within the original source data and m is a coding factor based on the probability distribution of the data. The bit length of the binary code (n mod m) can be either └log2m┘ or ┌log2m┐, where └x┘ denotes a floor function returning the greatest integer less than or equal to x, and ┌x┐ denotes a ceiling function returning the least integer greater than or equal to x. The following conditions determine the bit length for the binary code:
└log2m┘ bits if n less than 2┌log2m┐xe2x88x92m, and
└log2m┘ bits otherwise.
Table 2 shows Golomb codewords for several values of parameter m where the unary code (integer(n/m)) is represented using zero runs followed by a one, and an inverse binary code is used to represent n mod m.
To make the binary implementation simple, m is restricted to be a k-th power of 2 such that m=2k. This subset of the Golomb codes, commonly referred to as Rice codes, leads to very simple encoding and decoding procedures. The code for n is constructed by appending the k-th least significant bits of n (i.e. n mod m) to the unary representation of the number formed by the remaining higher order bits of n (i.e. integer(n/m)). Thus the binary portion of the Rice code has a fixed bit-length of k bits, and the total bit-length of any Rice codeword n is given by,       bitLength    ⁢          xe2x80x83        ⁢          (      n      )        =            ⌊              n                  2          k                    ⌋        +    1    +          k      .      
Generally, the optimal average codeword length for an input message, known as the message entropy, is calculated as the sum of each letter""s probability distribution multiplied by its self-information. That is,       H    =          -                        ∑                      i            =            1                    m                ⁢                              P            ⁡                          (                              A                i                            )                                xc3x97                      log            2                    ⁢                      xe2x80x83                    ⁢                      P            ⁡                          (                              A                i                            )                                            ,
where P(Ai) is the probability distribution for letter Ai. Entropy defines the smallest possible average codeword length achievable using variable length coding, and is generally expressed in units of bits per codeword or bits per sample.
The actual average codeword length, also measured in bits per codeword, can be found by,       l    =                  ∑                  i          =          1                m            ⁢                        P          ⁡                      (                          A              i                        )                          xc3x97                  n          ⁡                      (                          A              i                        )                                ,
where n(Ai) is the number of bits in the codeword for letter Ai. While Golomb-Rice coding mechanisms are simple to implement and do not require a table, often they do not generate optimal codeword lengths for most data distributions since they assume an exponentially decaying geometric probability distribution.
There is therefore a need for a compression coding and decoding mechanism where the compression performance comes closer to the entropy of the input data. There is also a need for a compression mechanism with a small lookup table, thereby allowing implementation in lower memory embedded systems and reducing the compression overhead of the mechanism. Such a compression mechanism should be easily adaptable to a change in data distribution and be implemented simply and quickly in either software or hardware.
Briefly stated, the present invention involves a method for encoding a data element in a data stream suitable for use in a data compression mechanism. The method includes the operations of determining a bin having a range of values which includes the data element value, wherein a bin number is associated to the bin, computing an offset of the data element from a minimum bin value, wherein the minimum bin value is associated to the bin, and encoding the bin number and the offset.
The encoding operation may further include representing the bin number in a uniquely decodable code, such as a prefix-free code or a suffix-free code. The bin number may also be encoded using a unary code or a context prediction mechanism. The encoding operation may further include representing the offset in a binary code or gray code. The calculating an offset operation further include computing the offset from an algebraic function which includes the minimum bin value, such as by subtracting the minimum bin value from the data element.
The method may further include storing a bin size in a bin lookup table, wherein the bin size is associated to the bin, and the bin size may be limited to a power of two. In addition, the method may include storing the minimum bin value and/or maximum bin value in a bin lookup table. The method may further include outputting the encoded bin number and offset.
Still another aspect of the invention is a method for decoding a coded element in a data stream suitable for restoring compressed data, wherein the coded element includes a bin number field and an offset field. The method includes the operations of determining a minimum bin value from the bin number field and calculating an original data value from the minimum bin value and the offset field.
The determining operation further include calculating the minimum bin value by recursively adding bin size values. The method may further include the operation of decoding the bin number field and offset field. The calculating operation can include computing the original data value from an algebraic function which includes the minimum bin value, such as by adding the minimum bin value to the offset field. The method may also include outputting the original data value.
In accordance with another aspect of the invention, the invention is a data structure for use in a data compression mechanism of a source data element. The data structure includes a bin number field associated to a bin, where the bin has a range of values which includes the value of the source data element, and an offset field computed using a minimum bin value associated to the bin. The bin number field can be represented by a uniquely decodable code, such as a prefix-free code or a suffix-free code. Additionally, the bin number field can be represented by a unary code or a context prediction mechanism. The offset field can be represented by a binary code or a gray code, and can be calculated from an algebraic function which includes the minimum bin value, such as by subtracting the minimum bin value from the data element.
The data structure may further include a bin lookup table, wherein the lookup table includes a bin size associated to the bin, and the bin size may be limited to a power of two. The bin lookup table may further include the bin minimum value, or a maximum bin value associated to the bin.
In accordance with yet another aspect of the invention, the invention is an apparatus for encoding a data element in a data stream suitable for use in a data compression mechanism. The apparatus includes a locating unit capable of locating a bin having a range of values which includes the data element value, wherein a bin number is associated to the bin; a computing unit capable of computing an offset of the data element from a minimum bin value, wherein the minimum bin value is associated to the bin; and an encoding unit capable of encoding the bin number and the offset. The apparatus may further include a memory unit capable of storing a representation of the minimum bin value.
Still another aspect of the invention is an apparatus for decoding a coded element in a data stream suitable for restoring compressed data, wherein the coded element includes a bin number field and an offset field. The apparatus includes a lookup unit capable of determining a minimum bin value from the bin number field and a calculating unit capable of calculating an original data value from the minimum bin value and the offset field. The apparatus may further include a memory unit capable of storing a representation of the minimum bin value.
Still another aspect of the invention is a computer program product embodied in a tangible media suitable for use in a data compression mechanism. The tangible media may include a magnetic disk, an optical disk, a propagating signal, or a random access memory device.