In overview, as illustrated in FIG. 1, known encoding methods of encoding input data D1 to generate corresponding encoded output data E2 involve applying one or more transformations T to the input data D1 to generate corresponding transformed encoded output data E2, wherein the transformed encoded output data E2 has associated therewith coding table data C information which is indicative of one or more coding tables defining the one or more transformations T employed. The encoded transformed data E2 and the coding table data C information, namely collectively the encoded output data E2, are often communicated via a data carrier and/or via a data communication network to one or more decoders which are operable to apply one or more inverse transformations T−1 to decode the encoded output data E2 to generate corresponding decoded data D3. It is often desirable that the encoded output data E2 is compressed relative to the input data D1, for example to reduce communication network capacity load when communicating the encoded output data E2. Moreover, it is also desirable that the encoded output data E2 is compressed in a substantially lossless manner, so that the decoded data D3 is an accurate reproduction of information included in the input data D1. Data compression achievable in the encoded output data E2 relative to the input data D1 is potentially inefficient when the coding table data C information is significant in size relative to the encoded transformed data E2, namely the coding table data C information corresponds to a significant data overhead in the transformed encoded data E2.
There are several known methods of encoding the input data D1 to generate the encoded output data E2. For example, known Huffman encoding or other VLC encoding methods are often employed to compress various types of data. Moreover, Arithmetic coding, or Range coding, are becoming increasingly popular for compressing input data, but are quite inefficient in situations where:
(i) a frequency table for the input data D1 is not already known by an encoder which is operable to encode the input data D1 to generate corresponding encoded output data E2, and by a decoder which is operable to decode the encoded output data E2; and
(ii) the amount of input data is relatively small, for example in a situation where the input data D1 is being communicated in small data segments or data chunks, wherein each data segment or data chunk is accompanied by a corresponding frequency table.
As aforesaid, such inefficiency arises due to delivery of one or more frequency tables consuming considerable data space, if it cannot be selected using relatively few identification parameters from a list of possible frequency tables, for example which the decoder has stored locally thereat. Moreover, it is also less probable to find suitable frequency tables than suitable code tables from such a list. Often, the input data D1 to be encoded can also vary locally, for example it is transformed during transmission through a communication network to conform to spatially local data standards for the communication network.
There are known methods available for delivering code tables or frequency tables in association with communicating encoded data content derived from symbols. Most of the known methods employ a direct delivery of a Huffman tree or frequencies of the symbols. Such known methods are not so satisfactory, because they require considerable information to be delivered from an encoder to a corresponding decoder. Moreover, there are also known methods of delivering lengths of code table symbols, for example as employed in a known Intel IPP-library, which has been contemporarily deprecated; there is employed a method of compressing a code table, namely by way of “HuffLenCodeTablePacK”, and decoding it back again, namely by way of “HuffLenCodeTableUnpack”; however, this method is not satisfactory and sometimes even increases the size of data during encoding processes. Moreover, the method also requires that there are 256 symbols, and all the symbols from 0 to 255 have a non-zero length for their code words. Methods that deliver code tables are still clearly amongst most efficient delivery mechanisms that are currently available for prefix codes that are generated, for example by Huffman encoding techniques. When a Huffman tree is delivered from an encoder to a corresponding decoder, generated code symbols from the encoder are always similar in the encoder and in the decoder. When only a frequency table is delivered, there then has to be similar algorithms used in an encoder and in a decoder for actual Huffman tree generation from the frequency table to be achieved, if the Huffman tree is needed, to enable decoding of the symbols in a proper manner at the decoder. If the lengths of code table symbols are delivered, then similar methods from lengths of symbol-to-frequency-table transformation are also needed in the encoder and in the decoder to enable decoding of the symbols in a proper manner. Communicating lengths of symbols from the encoder to the decoder is not a practical method of delivering frequencies for Arithmetic coding and Range coding, because they are designed to support more accurate frequency tables than merely enabled by communicating the lengths of code symbols. Lengths of code symbols can also be used in Arithmetic coding and Range coding. However, these methods do not offer benefits compared to Huffman Coding for example, if no adaptive update of tables is executed later for future data. Delivery of information which is indicative of probabilities offers usually more optimal coding results with Range coding or Arithmetic coding in contradistinction to Huffman coding. The probabilities of the symbols can be calculated by dividing frequencies of occurrence of the symbols by a sum of the symbol frequencies of occurrence, namely=number of symbols. A delivery of such probabilities is beneficially made by using scaled probability values. Scaled probability values can be calculated by multiplying original symbol probability values with an integer number that is advantageously a power of two value, namely 2n wherein n is an integer, and by then rounding it to a nearest integer value. The sum of these scaled probabilities as integer numbers is equalized to be the same as the multiplier value. An escape code symbol is also beneficially created for symbols that otherwise are not allocated their own non-zero scaled probability value. This means that those symbols that need an escape code have a probability that is smaller than what can be presented with the selected multiplier value. It is also possible to create scaled probabilities without employing an escape code with two different mechanisms. The multiplier value can be increased and then the new probability values can be calculated. It is also possible to upgrade those scaled probability values for available symbols that are equal to zero to be equal to one. This probability value upgrade requires that the increase of the probability values are compensated by decreasing the probability values of the other symbols. This is done so as to make the sum of probabilities exactly the same as the multiplier value. This procedure makes the probability values not as well optimized as could be possible, but escape symbols are not needed, and in some cases it might be still the optimal coding solution. Lengths of symbols or probability values define rough estimate of frequency table that can be used for methods that employ variable length coding symbols, for example Huffman coding, Range coding, Arithmetic coding and any other variable length coding methods. It will be appreciated that the scaled probability table can directly be used as the rough estimate of symbol frequencies when they are needed, and lengths of symbols need to be first converted to the rough estimate of symbol frequencies before the utilization. The conversion from length of symbols to frequency table will be shown later during the data encoding and table delivery.
Many known practical methods of encoding data do not utilize optimized code tables at all, namely they utilize fixed code tables for encoding data to generate corresponding encoded data, and fixed code tables for subsequently decoding the encoded data. Sometimes tables are updated with adaptive methods based on the delivered symbols. In certain known methods, there are sometimes utilized a couple of different code tables, alternatively frequency tables, for encoding data in an encoder and correspondingly decoding the encoded data in a decoder, wherein an index defining a selected code table, probability table, or frequency table, is delivered as information from the encoder to the decoder. In certain methods, there are employed separate tables for luminance and color channels, for inter- and intra-blocks, or for different kinds of data; however, the separate tables are communicated in an inefficient manner; for example, reference is herewith made to a following Internet web-site (Wikipedia): http://en.wikipedia.org/wiki/Huffman_coding. During decompression, using Huffman-based methods, a Huffman tree must be reconstructed. In a simplest case, where character frequencies are relatively predictable, the tree is susceptible to being reconstructed, and even statistically adjusted on each compression cycle, and thus reused every time, at an expense of at least some measure of compression efficiency; alternatively, Huffman-tree information must be sent a priori, namely, beforehand.
A simple approach of prepending frequency counts relating to symbols that are coded into an output stream of compressed data has a major disadvantage of increasing a data volume in the compressed data by at least several kiloBytes (kB) in practice, so such a simple approach has little practical use. If the data is compressed using canonical encoding, the compression model can be precisely reconstructed with just B2B bits of information, wherein B is the number of bits per symbol, for example, with 8 bits it requires 2 kB.
Another method is simply to prepend the Huffman tree, bit by bit, to the compressed output stream. For example, assuming that the value of 0 represents a parent node and 1 a leaf node, whenever the latter is encountered, a tree-building routine simply reads a next 8 bits to determine a character value of that particular leaf. Such a process continues recursively until a last leaf node is reached; at that point, the Huffman tree will thus be faithfully reconstructed, for example at a decoder. A data overhead arising from using such a method ranges from roughly 2 to 320 bytes, assuming an 8-bit alphabet.
In order to elucidate further known methods of encoding data and corresponding methods of decoding encoded data, Huffman decoding will next be described in overview. It will be appreciated that any other methods, for example Range decoding or Arithmetic decoding, can also be utilized instead of Huffman decoding. Before commencing compression of a data file, a compressor, in an encoder, has to determine codes to be employed when executing the compression.
When Huffman decoding is employed, before starting compression of a given data file including symbols to generate corresponding encoded output data, an encoder has to determine codes that are to be used to represent the given data. Conveniently, the codes are based on probabilities, namely frequencies of occurrence, of the symbols in the given data file. However, the frequencies, probabilities, or lengths of symbols have to be recorded, for example as side information, namely as supplementary information, in the encoded output data, so that any Huffman decoder will be able to decode the encoded output data to generate corresponding decoded data. Conveniently, the frequencies of occurrence or lengths of symbols are integers, or probabilities which can be expressed as scales integers; such integers included in the supplementary information add often merely a few hundred bytes to the encoded output data. Optionally, it is also possible to write variable length codes themselves to the encoded output data, but this may in certain circumstances be awkward, because the codes may have mutually different sizes. Alternatively, it is feasible to write the Huffman tree to the encoded output data, but this requires more data to be communicated than merely communicating frequencies of occurrence of the symbols in the given data.
During operation, the decoder must be provided with information regarding what is at a start of an encoded compressed file received at the decoder for being decoded. From data extracted from the encoded compressed file, for example from a start thereof, the decoder is operable to construct an alphabet of a Huffman tree. After the Huffman tree has been constructed in the decoder, the decoder is then able to decode a remainder of the file, using the Huffman tree as a decoding tool. The decoder employs a relatively simple decoding algorithm which includes following steps:
(a) start at a root of the Huffman tree, and then read a first bit of the encoded output data to be decoded using the Huffman tree;
(b) if the first bit is a “1”, then follow a top edge of the Huffman tree; if the first bit is a “0”, then follow a bottom edge of the Huffman tree;
(c) read a second bit of the encoded output data, and then employ the second bit in a manner akin to the step (b) towards “leaves” of the Huffmann tree, and so on until a “leaf” of the Huffman tree is eventually reached, whereat an original uncompressed symbol is to be found, often an associated ASCII code; that code is then output from the decoder; and
(d) steps (b) and (c) are repeated until the encoded output data has been decoded.
Known contemporary Huffman encoding is beneficial to employ when an encoded string is large in size relative to a code table employed to generate the string. Moreover, such contemporary Huffman coding is beneficial to employ when a code table is defined a priori for both an encoder and a corresponding decoder. There is thus a need for alternative encoding methods which address aforementioned limitations associated with known approaches to encoding and decoding data, for example aforesaid Huffman encoding and decoding methods.