Data compression is widely used in the processes of transmitting and storing data to increase the transmission speed and reduce the storage requirements for the data. Data compression is a process that reduces the number of bits of information that are used to represent information in a data file or data stream. An example of a common data compression process is LZ77, in which a data sequence is compressed by indicating portions of the data sequence which have previously occurred in the sequence, which is shown with respect to the coding method 100 depicted in FIG. 1.
As shown in FIG. 1, a plurality of characters forming a data sequence 102 is depicted in their respective input positions 104. Also depicted is the code 106 corresponding to the order in which the data sequence 102 is compressed. In the coding method 100, it is assumed that the data sequence 102 is inputted in the input position order shown therein. As such, the character “A” in the first input position is inputted first and the character “B” in the second input position is inputted second, and so forth.
The character “A” in the first input position is coded as is because there have been no previously inputted or coded characters. The characters that are coded as is are considered literals 108. When the character “B” in the second input position is inputted, it is compared with the previously inputted characters. In the example shown in FIG. 1, the character “B” is also coded as is because there are no preceding matching characters, and is thus also a literal 108. However, when the characters “A” and “B” in the third and fourth positions are inputted and compared with the previously inputted characters, a match 110 is determined with the characters in the first and second input positions. In this case, a match offset 120 and a match length 122 are coded instead of the characters “A” and “B”.
The match offset 120 identifies the location of the match 110 (repetitive characters) and the match length 122 identifies the length of the match 110. In the coding method 100, the match offset 120 for the characters “A B” in the third and fourth input positions is “2”, which specifies the difference between the current position (position of the next symbol to be processed) and the input position 104 of the first character “A” in the match 110, and the match length 122 is “2” because the match 110 includes two characters “A” and “B”.
In addition, when the character “C” in the fifth position is input, it is coded as is because there have been no previously matching characters inputted. Similarly, the characters “D” and “E” in the sixth and seventh positions, respectively, are coded as is for the same reasons. However, when the characters “A B C D E” are input, they are determined to match the characters in the third to the seventh input positions 104. As such, the match 110 is coded with a match offset 120 of “5”, which identifies the match 110 as being located 5 positions back from the current position and the match length 122 is “5” because there are five characters in the match 110. The coding method 100 repeats this process to code the remaining characters in the data sequence 102.
When a decoder decompresses the data sequence containing the match offsets and the match lengths, the decoder is able to identify the characters referenced by the matches 110 because the decoder has already received and decoded the characters in the earlier input positions 104.
Variations in manners in which the LZ77 process is implemented have been widely implemented. Examples of these variations include the LZ78, LZW, LZSS, LZMA, as well as other variants thereof.
Another manner in which data is compressed is through context modeling, in which, a probability distribution that models the frequency with which each symbol is expected to occur in each context is associated with each context. By way of example, in a binary stream of zero's and one's, the context is the value of the previous bit. In this example, there are two contexts, one context corresponding to the previous bit being a zero and another context corresponding to the previous bit being a one. Thus, for instance, whenever the previous bit is a zero, there will be a probability distribution on the value of the next bit being a zero or a one.
Under context modeling, compression is generally achieved by assigning shorter codes (or codewords) to data symbols that are more likely to occur and longer codes to data symbols that are less likely to occur. In order to achieve this compression, the data symbols are modeled and the probabilities are assigned to the likelihoods of the data symbol values occurring. By way of example, in an English text sequence, if the context, comprised of the immediately preceding symbol, is “q”, it is more probable that the next letter will be a “u” as compared with a “z”. As such, compression may be enhanced by assigning a shorter code to the letter “u” as compared to the letter “z”. Arithmetic coding is a well known technique for carrying out such a codeword assignment that is especially compatible with context dependent probability assignment. More particularly, in arithmetic coding a sequence of probabilities assigned to symbols are sequentially mapped into a decodable binary codeword whose length is close to the sum of the base 2 logarithms of the inverses of the assigned probabilities.
Although suitable for most data compression operations, improvements upon these processes to further compress data would be beneficial to thereby further reduce bandwidth requirements and storage capacities for the compressed data.