Compression reduces the size of a set of data by its equivalent representation in another form. Data compression refers to the process of reducing the amount of data needed to represent information. Data compression techniques reduce the costs for information storage and transmission and are used in many applications, ranging from simple file size reduction to speech and video encoding.
Most commonly used compression methods are either dictionary-based or statistical. Statistical methods combine entropy coding with modeling techniques. Typically, statistical methods are used for compressing executable code. Each input symbol in a sequence of input symbols is represented by a variable length code to produce a stream of codes representing the input symbols that has fewer bits than the sequence of input symbols. Each input symbol has a certain probability value associated with its frequency of occurrence in the sequence of input symbols. In order to reduce the number of bits, most statistical compression methods encode the symbols with the highest probability of occurrence with codes having the least number of bits.
Typically, statistical compression methods include a model and a coder. The model includes statistical information obtained from the sequence of input symbols. In the simplest case, for example, the Markov model, the model provides probability values for the input symbols based on their frequency of appearance in the sequence of input symbols. The coder produces an encoded sequence of codes from the sequence of input symbols and the probability values provided by the model.
Executable code is a linear sequence of instructions. For a given machine architecture, an instruction has a specific format, generally including three fields: an opcode, an addressing mode and an operand. Statistical compression of executable code differs from statistical compression of regular data because of statistical dependencies specific to the structures of executable code. Statistical dependencies exist between the fields of an instruction, which is called intra-instruction correlation. Moreover, there are also strong statistical dependencies between instructions, called inter-instruction correlation, because a machine language is characterized by its syntax, semantics and modularity. The intra-instruction and inter-instruction correlations are tangled with each other in complicated ways.
Typically, in order to utilize the statistical dependencies between instructions to achieve a high compression ratio, there is a rigid mechanical separation of opcodes and the rest of an executable program. However, this separation methodology is suboptimal because it prevents the exploitation of some intra-instruction correlations. But on the other hand, not extracting the opcodes and compressing an executable program as one sequence obscures the inter-instruction correlation, also compromising compression performance.
Both intra-instruction and inter-instruction correlations may be exploited by combining the opcode of the instruction with the addressing mode of the instruction, treating them as an extended opcode, and then separating and compressing the sequence of extended opcodes for the instructions. But this alternative method is also problematic because it artificially creates a sequential coupling between the opcode of the current instruction and the addressing mode of the previous instruction, even though these two entities have a very weak correlation. As a result, sequential compression of the extended opcodes does not achieve a high compression ratio.