With the advance in computing, more and more data is generated and as a result, a significant problem arises in both storage and transmission of data as the volume of data explodes. As used herein, the term “data” broadly refers to electronic representation of information, such as, for example, source code of software, documents, video files, audio files, graphic files (e.g., bitmaps), etc. One common solution for handling large volumes of data is data compression. For instance, data may be compressed before being stored in order to save storage space. Likewise, data may be compressed before being transmitted in order to reduce network bandwidth used in transmission.
One conventional data compression technique is adaptive data compression using prediction by partial matching (PPM) models. The compression algorithm generally works by collecting statistics of input symbols that have been seen in an input stream of data, then relating the statistics to the last several symbols seen. The maximum number of symbols that can be used in an attempted match is the order of the model. For instance, an order three model may use up to the last three input symbols to try to find a prediction for a current input symbol.
In one conventional implementation of a PPM model, the statistics are stored in a collection of fixed-size, static tables, where each table contains a large collection of links and counts, one link and one count for each possible input symbol. For example, in a model used on an input stream of 8-bit bytes, there may be 256 links and counts, and in some cases, an additional link to a preceding context, and an additional count of a special purpose pseudo code used to signal that the current input symbol has not been seen in this context before. For short contexts, the above approach is generally acceptable in terms of memory usage because the tables tend to fill up. However, at longer contexts, most of the links and counts in the tables are likely never used. As such, much of the space allocated to the tables is wasted.
In addition to the waste of space, the above conventional compression technique also tends to be slow for longer contexts because over half of the entries, on average, in a used table are read whenever an input symbol is coded. The counts for about half of the entries are added together to arrive at the starting point for the range spanned by the found input symbol. Such computing may take a long time and thus, leading to slower compression for longer contexts.