1. The Field of the Invention
The present invention relates to compression technology. More specifically, the present invention relates to methods, systems and computer program products for performing compression of program binaries using an advanced form of sequential correlation.
2. Background and Relevant Art
Computing systems have revolutionized the way people work and play. Original computing systems were rather monolithic, stand-alone mainframe computing systems often occupying entire rooms despite their relatively low processing and memory capabilities by modern standards. Currently, however, a wide variety of computing systems are available that are often even more powerful than there much larger mainframe ancestors. For example, a computing system may include a desktop computer, a laptop computer, a Personal Digital Assistant (PDA), a mobile telephone, or any other system in which machine-readable instructions or “binaries” may be executed by one or more processors. Computers may even be networked together to allow information to be exchanged electronically even over large distances as when using the Internet.
Despite monumental advances in computing technology, computing systems still have limited memory resources and network bandwidth that will vary depending on the computing system. In order to preserve memory resources and network bandwidth, compression technology is often employed to reduce the size of files (or any other data segments such as programs or software modules) with minimal, if any, loss in information. While there are many varying compression technologies, all compression technologies reduce the size of a data segment by taking advantage of redundancies in the file. By reducing the size of the file, the memory needed to store the file and the bandwidth needed to transmit the file are both reduced. The power requirements for processing compressed files are also often reduced which is especially relevant to low power environments such as mobile devices.
Text is often compressed as the semantic and syntactic rules that structure the text also introduce a high degree of redundancy in the text. Patterns can be detected in such text that allow one to make reasonable guesses as to the text that follows based on the text that was just read. Skilled human readers with sufficient reading comprehension skills can, for example, often reasonably predict how a sentence will be completed before even reading the entire sentence. Such prediction would not be possible if the text was simply a random sequence of arbitrary text characters, following no syntactic or semantic rules.
Due to the predictability of text, text is said to have a high degree of local sequential correlation. That is, a human, and even a computer, can make reasonable predictions as to what text will follow, based on the immediately preceding text. One compression technology that takes advantage of the high degree of local sequential correlation in text is called Prediction by Partial Matching compression or “PPM” compression for short.
PPM compression, and its numerous variants and improvements, are well-known to one of ordinary skill in the art and thus will not be described herein in detail. However, the fundamentals of PPM compression are now described for context. A PPM engine receives an input data-stream to be compressed. As one might expect, the PPM compression process involves sophisticated mathematical manipulations. Accordingly, throughout this summary description of PPM compression, certain mathematical nomenclature is used to describe such mathematical manipulations. While the nomenclature is typically known to one of ordinary skill in the art, the nomenclature will be described in detail for clarity.
Let the input data-stream to be compressed be denoted as x∈{A}N, where x is a sequence of N symbols from an alphabet A. In the case of a string of English text, for example, the alphabet A would might include characters that are used in English text. x is a specific string of characters from the designated alphabet. In the following discussion, a specific example string of “shareware” is often referred to although a typical string of text may be many thousands or even millions of characters long. In the example of the string “shareware”, N would be nine since there are nine characters in the string “shareware”.
A “context of order K” is defined as a sequence of K consecutive symbols of x. Let the i′th symbol of the string x be denoted as xi. A context of order K for the symbol xi is the sequence of symbols {xi-K, through xi-1}. For example, for the example string “shareware”, a context of order four of the character “w” is “hare”. PC(s) denotes the probability that a character s follows a context C, where s belongs to the alphabet A.
While both compressing and decompressing, PPM builds a model of the input data-stream that aims at estimating the probability that a certain symbol occurs after a certain context. PPM encodes a symbol with the amount of information proportional to the probability that the symbol appears after its current context of a certain order. The maximum referenced context order is constant.
The PPM model has entries for each unique context that have occurred in the processed portion of the input data-stream. For example, suppose that the string “shareware” is processed with a maximum context of two. The order two contexts would be “sh”, “ha”, “ar, “re”, “ew”, and “wa”. The order one contexts would be “s”, “h”, “a”, “r”, “e” and “w”. An order zero context would also be present for processing purposes and may be considered the null set “”.
The model counts the symbols that have occurred after each context in the processed data-stream. If a character has never been encountered before following a particular context, the count of an escape entry (as represented by a character “ε”) is incremented by one. For the example string “shareware” with a maximum context of order two, the symbol “a” for the context “sh” would have a count of one. In other words, the symbol “a” only followed the two character string “sh” once. The escape count for the context “sh” would also be one since when the “a” was encounter after processing the text “sha”, the “a” had never before occurred in the string following the characters “sh”, and since no symbol other than “a” had ever followed the two character string “sh” in the example string “shareware”. The following Table 1 illustrates the entries that would result from pure PPM after having processed the string “shareware” with a maximum context of two.
TABLE 1FollowingEscape (ε)Context OrderContextSymbolCountCount0“”“s”, “h”, “a”, “r”,1, 1, 2, 2, 2, 1,6“e”, “w”respectively1“s”“h”111“h”“a”111“a”“r”211“r”“e”211“e”“w”111“w”“a”112“sh”“a”112“ha”“r”112“ar”“e”212“re”“w”112“ew”“a”112“wa”“r”11
If the string “shareware” were to continue, then a new entry would be added whenever a new symbol follows a particular context. If the context is not yet at the maximum context order, then a new context is created. Each new entry is initialized with a count field for the new symbol that created the context, as well as a count field for the escape character ε for the new context.
Note that each context of order one or two has an escape count of one since only one character follows each of the possible contexts. Accordingly, only one previously unencountered character was encountered for each context. Had the string been “sharewares”, however, the escape count for context “re” would have been two, since two different characters following the context “re” including “w” and “s”.
If a symbol has occurred in the processed data stream, the probability of any symbol occurring given a particular context is given by the count of the symbol following the context as compared to the total count for all symbols following the context. For example, the context of order zero or the null set “” has a total count of fifteen including six for the escape character. The count for the symbol “r” following the context of order zero is two. Accordingly, the probability of the symbol “r” occurring given a context of order zero of “” is 2/15, which may be expressed as P0(r)=2/15.
The current order is incremented by one if the current order is not yet the maximum allowable order, and if an already existing symbol was found after an existing context. The current order is decremented by one if a new symbol was found after an existing context.
A special context of order −1 contains all the symbols in the alphabet A that have not yet been encountered to that point in processing the input data-stream. From this context, the probability of occurrence is uniformly for all symbols that belong to the alphabet A. For example, if there are 100 symbols in the alphabet A, then the probability of any given character occurring before processing of the input data-stream is exactly one percent.
PPM uses arithmetic coding to encode predicted and escape symbols according to the probabilities of their occurrence after a certain context. The arithmetic coder converts an input data-stream of arbitrary length into a single rational number from zero to one, which range may be expressed as [0, 1}. In this description, a range from number a (inclusive) to number b (but excluding the number b) is denoted as [a, b}. A square bracket “[” or “]” indicates that the range goes to and includes the adjacently expressed number. A curled bracket “{” or “}” indicates that the ranges goes to, but does not include, the adjacently expressed number.
Now described is how the symbol “s” concatenated to “shareware” would be arithmetically coded assuming that that the previous symbols have already been encoded and the current range when coding the final “s” is reset to [0, 1}. The longest (order two) context for “s” is “re”. According to the current PPM model, the probability of “w” occurring after context “re” is the same as the probability of an escape character “ε” occurring after context “re”. In the arithmetic language previously set forth, Pre(w)=Pre(ε)=½. The arithmetic coder thus divides the range [0, 1} into two subranges [0, 0.5} and [0.5, 1}, each representing “w” and “ε” respectively.
Since “s” has not been previously recorded after context “re”, the order two context is decremented to the corresponding order one context for “s”. In this case, that order one context is “e”. Also, the escape symbol ε is emitted by limiting output range to [0.5, 1}. Since Pe(w)=Pe(ε)=½, the arithmetic decoder further divides the range to [0.5, 0.75} and [0.75, 1} for “w” and “ε”, respectively. Since “s” has never occurred after order one context “e” either, the current context is decremented by one to be an order zero context “”. Also, the escape character ε is emitted by shrinking the range to [0.75, 1}.
The order zero context “” contains seven symbols, including “s”. Accordingly, the range is divided into seven ranges corresponding to each of the seven ranges. The size of each of the seven ranges is proportional to the probability of occurrence for each of the seven ranges. Accordingly, the ranges may be assigned for order zero context “” as shown in the following Table 2.
TABLE 2SymbolProbabilityRange“s”1/15[0.75, 0.7667}“h”1/15[0.7667, 0.7833}“a”2/15[0.7833, 0.8167}“r”2/15[0.8167, 0.85}“e”2/15[0.85, 0.8833}“w”1/15[0.8833, 0.9}“ε”6/15[0.9, 1}
To encode “s”, the arithmetic coder reduces the range to [0.75, 0.7667}. If “s” was the last symbol to be encoded, the arithmetic coder would output a result of the compression any number in the range [0.75, 0.7667}. Given the starting PPM model, any number within [0.75, 0.7667} uniquely identifies the symbol “s” at the decoder. After processing a symbol, the PPM model is updated. For example, referring to Table 1, having processed “s” after context “re” would result in a new entry of “s” after context “re”. Similarly, having processed “s” after context “e” would result in a new entry of “s” after context “e”. Having processed “s” after context “” would result in the count for “s” for the context “” being incremented by one. In sum, Table 1 would be altered as shown in Table 3.
TABLE 3FollowingEscape (ε)Context OrderContextSymbolCountCount0“”“s”, “h”, “a”, “r”,2*, 1, 2, 2, 2,6“e”, “w”1, respectively1“s”“h”111“h““a”111“a”“r”211“r”“e”211“e”“w”, “s”*1, 1*,1respectively1“w”“a”112“sh”“a”112“ha”“r”112“ar”“e”212“re”“w”, “s”*1, 1*,1respectively2“ew”“a”112“wa”“r”11(The asterisk symbol * identifies portions where Table 3 has changed from Table 1)
The arithmetic coder iteratively reduces its operating range until the leading digits of the high and low bound are equal. Then, the leading digit may be transmitted. In the above example where the range was [0.75, 0.7667}, the digit 7 may be transmitted and the updated range becomes [0.5, 0.6667}. This process is often referred to as “normalization” and allows compression of files of any length on limited precision arithmetic coders. The PPM compression process is fully invertible with the PPM decompression process. In other words, the decoder need only repeat the coding process described above to arrive at the correct string of text. The only caution is that the same version of PPM used to compress must be the same version of PPM used to decompress. If different, even slightly, decompression would almost certainly be unsuccessful.
A number of PPM variants have been developed as an improvement to the original PPM algorithm described above. One difference is how the escape symbols ε probabilities are calculated. For example, one variant fixes the count of escape symbols ε to one for any given context. Another only increments the escape count by one half when encountering a previously unencountered symbol for a given context. Other variants include the calculation of escape probabilities using heuristics that account for the number of symbols that occurred in a given context. One standard improvement to PPM called “exclusion” assumes that only the context where the symbol is found as well as higher order contexts are updated in the PPM model.
PPM uses local sequential correlation to perform its predictions. Hereinafter, unless specifically limited, “PPM” refers to the original PPM described in some detail above, along with all of its variants and improvements including those referred to above, as well as other variants derived from the original PPM compression technology. Due to its heavy emphasis on local sequential correlation, and due to the local sequential correlation inherent in text that follows specific syntactic and semantic rules, PPM is heavily used to compress text files.
PPM is also used to compress program binaries. As used herein, “program binaries” mean a sequence of machine-level executable instructions. As is apparent from the above description of PPM, PPM exclusively explores the localized correlations in one-dimensional neighborhoods of the input data-stream. For example, PPM uses context that are adjacently preceding the currently evaluated symbol.
Program binaries are structured somewhat different than text. Text may be conceived as a stream of conversation flowing in one direction. Program binaries may be conceived as a vertical list of instructions. Program binaries have some degree of local sequential correlation within a single instruction. This local sequential correlation is referred to herein as “horizontal correlation”. However, program binaries also have correlation between similar fields in different instructions. That correlation is referred to herein as “vertical correlation”. PPM is well-suited for taking advantage of the horizontal correlation, but is not well-suited for taking advantage of the vertical correlation present in a sequential list of program binaries.
In order to conserve memory resources and network bandwidth, it is desirable to compress files that are to be stored and transmitted as much as possible. This is true of executable files and other sequences of program binaries. Accordingly, what is desired are methods, systems, and computer program product for performing more efficient compression of program binaries than that allowed by compression mechanisms that take advantage of horizontal correlation alone.