1. References
The following papers provide useful background information, for which they are incorporated herein by reference in their entirety, and are selectively referred to in the remainder of this disclosure by their accompanying reference numbers in triangular brackets (i.e., <3> for the third numbered paper by Ishiura et al.):    <1> L. Benini, A. Macii, E. Macii, and M. Poncino. Selective Instruction Compression for Memory Energy Reduction in Embedded Systems. IEEE/ACM Proc. of International Symposium on Low Power Electronics and Design (ISLPED'99), pages 206–211, 1999.    <2> IBM. CodePack PowerPC Code Compression Utility User's Manual. Version 3.0, 1998.    <3> N. Ishiura and M. Yamaguchi. Instruction Code Compression for Application Specific VLIW Processors Based on Automatic Field Partitioning. Proceedings of the Workshop on Synthesis and System Integration of Mixed Technologies, pages 105–109, 1998.    <4> C. Lefurgy, P. Bird, I. Cheng, and T. Mudge. Code Density Using Compression Techniques. Proceedings of the Annual International Symposium on MicroArchitecture, pages 194–203, December 1997.    <5> C. Lefurgy and T. Mudge. Code Compression for DSP. CSE-TR-380-98, University of Michigan, November 1998.    <6> C. Lefurgy, E. Piccininni, and T. Mudge. Reducing Code Size with Run-time Decompression. Proceedings of the International Symposium of High-Performance Computer Architecture, January 2000.    <7> S. Y. Liao, S. Devadas, and K. Keutzer. Code Density Optimization for Embedded DSP Processors Using Data Compression Techniques.
Proceedings of the Chapel Hill Conference on Advanced Research in VLSI, pages 393–399, 1995.    <8> T. Okuma, H. Tomiyama, A. Inoue, E. Fajar, and H. Yasuura. Instruction Encoding Techniques for Area Minimization of Instruction ROM. International Symposium on System Synthesis, pages 125–130, December 1998.    <9> A. Wolfe and A. Chanin. Executing Compressed Programs on an Embedded RISC Architecture. Proceedings of the International Symposium on Microarchitecture, pages 81–91, December 1992.    <10> Y. Yoshida, B.-Y. Song, H. Okuhata, and T. Onoye. An Object Code Compression Approach to Embedded Processors. Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), ACM:265–268, August 1997.
2. Introduction
The advent of ever decreasing feature sizes in silicon technology following Moore's Law has ever since imposed designers with severe constraints: even though higher integration densities allow for smaller die sizes—assuming a constant amount of transistors per die—in reality, the die sizes have increased rapidly, too, driven by the demand for more complex applications requiring more processing power and memory sizes. Secondary effects resulting from this trend are significantly increased power dissipation per area, signal integrity problems etc. Diverse techniques at various levels of abstraction are deployed to cope with these problems.
Code compression is an old art that is around since the early days of micro processors. When the instruction code of a processor can be compressed significantly then the memory usage, and as such the chip area, can be reduced by a more or less noticeable amount and thus helping to solve some of the above-mentioned problems. However, code compression had a significant impact, mainly because it was focused on memory size reduction only. Taking into consideration the additional hardware necessary to achieve decompression, this overhead was in many cases not justified.
Recent research activities have investigated ways to extend the benefits of code compression technologies: rather than just aiming to minimize the memory usage through code compression, it has been investigated in how far code compression can contribute to increase the performance of a system or even minimize power consumption. The key to these extended benefits lays in techniques that are designed to place the decompression hardware as close as possible to the location where instruction code is being used, i.e. the processor. Using this approach, many system parts like buses cache hierarchies, main memory etc. can all benefit from compressed instruction code through higher bandwidths (bus, memory system).
The problem involved with applying these techniques, however, is the significantly increased complexity of the decompression hardware that has to decompress instructions on-the-fly. As discussed in this disclosure on, this involves carefully designed hardware. As benefits, a properly designed system using code compression can boost performance, reduce memory usage and decrease power consumption.
3. Related Work
In the following we will review the most related work and afterwards point out the differences and advantages of our approach.
Wolfe and Chanin developed the Compressed Code RISC Processor (CCRP), which was the first system to use cachemisses to trigger decompression <9>. Their decompression engine is designed as part of the cache refill hardware.
The instructions in each L1 cache block are Huffman encoded separately so that each block can be individually decompressed without requiring decompression of other blocks in advance. As Huffman codes are variable length codes, decoding is not as fast as with dictionary methods. Since the fixed-length cache blocks are compressed to variable-length blocks, an index table is required to map native cache-miss addresses to compressed code addresses.
This requires the decompression engine to conduct one more level of lookup to find the data. The authors report a 73% compression ratio on the MIPS architecture.
CodePack is used in IBM's embedded PowerPC systems <2>. Their scheme resembles CCRP in that it is part of the memory system. The CPU is unaware of compression, and a LAT-like device maps between the native and compressed address spaces. The decompression engine accepts L1-cache miss addresses, retrieves the corresponding compressed bytes from main memory, decompresses them, and returns native PowerPC instructions to the L1-cache.
CodePack achieves 60% compression ration on PowerPC. IBM reports that performance change in compressed code is within 10% of native programs—sometimes with speedup. A speedup is possible because CodePack implements pre-fetching behavior that the underlying processor does not have.
Software decompression is also possible, simplifying the hardware design and allowing the decompression to be selected at run-time. The hardware is simplified because the decompression software uses the arithmetic unit in the processor core, rather than having separate specialized logic structures. Lefurgy et al. <6> proposed two hardware mechanisms to support software decompression. First an L1 cache miss triggers a cache miss exception that runs the decompression program. Second, a privileged instruction used by the decompression stores decompressed instructions directly into the instruction cache. The decompression software is not compressed and resides in a region of memory that does not cause a decompression exception. Another technique that can be carried away purely in software is a dictionary method proposed by Liao et al. <7> where mini-subroutines are introduced replacing frequently appearing code fragments.
Ishiura and Yamaguchi <3> proposed a compression scheme for VLIW processors based on automated field partitioning.
They keep the size of the decompression tables small by producing codes for sub-fields of instructions. Benini et al. <1> limit the dictionary size by selectively compressing instructions. Lefurgy et al. also proposed a dictionary scheme used in their DSP compression work <5>. Okuma et al. <8> proposed an interesting encoding technique that takes into account fields within instructions. Yoshida et al. <10> proposed a logarithmic-based compression scheme which can result in power reduction as well.
C. Code Compression Basics
The following describes basic techniques and concepts that are crucial for code compression.
1. Random Access
Random access is an important concept in code compression. As opposed to compressing whole files (e.g. images) in code compression it is necessary to provide the possibility to decompress single code section out of the whole code at a certain time. In other words, it must be possible to randomly access, i.e. decompress those code sections. Random access is necessary due to the nature of software programs whose control flow is non-sequential. The possibility of decompressing the whole code at once is technically not interesting since the memory usage for decompressing the whole a code as a single stream require at least as much memory as is needed by the uncompressed program. Thus, a non-random-access code compression technique does not benefit from decreased system memory usage.
2. Granularity in Code Compression
The above-described random access characteristic requires to decompose the whole code into sections such that each section can be decompressed on its own. Because of the decompression history decompression can only start at the beginning of the boundaries of these sections. There various possibilities for these sections:
a) Basic Block
A basic block as a sequence of code that is always and completely executed from the beginning straight to the end is the most obvious granularity as an implication of the random access characteristic. A basic block typically contain many assembly instructions. In this sense a basic block has a reasonable size in order to provide a good compression ratio. The disadvantage of using a basic block is the great variance in size that can reach anything from a single assembly instruction to hundred of assembly instruction. In terms of technical implementation of a decompression mechanism this means a great variances in decompression time and causes some non-deterministic behavior as far as system execution time is concerned. Related to this problem is the absolute decompression time: assuming a reasonable hardware effort it is impossible to decompress a basic block within a system clock cycle (assuming that it is a speed-optimized system) due to the average size of a basic block. However, depending on the architecture (see also II-D) fast decompression might be required that guarantees decompression in a few or even just one clock cycle.
b) Instruction
The smallest, technically feasible, entity to apply code compression to is a single instruction. The size of a single instruction makes it possible to decompress it within a single clock cycle. Therefore, it is very beneficial for the so-called post-cache architecture (see also II-D). According to the small size, however, compression ratios are significantly reduced to basic block-based approaches. The complexity of a decompression hardware depends on the instruction format:
What the granularity of such a part is will be discussed later in this disclsoure. “Decompression history” is related to the state of the decompression mechanism.
(1) Non-fixed Instruction Sizes
A non-fixed instruction imposes various constraints on the compression scheme: in a dictionary-based compression approach symbols of varying size may waste more or less bits or, alternatively, many dictionaries, each keeping symbols of same size, represent a complex hardware scheme. When the compressed instruction stream is decompressed, instructions of various sizes are generated. It is then the task of the hardware to assemble these instructions to complete words (for example 32 bits) that can be sent to the processor. The recognition of uncompressed instruction sizes along with the word assembling is a very hardware-intensive and latency consuming task.
The example platform discussed herein, that implements the disclosed techniques, is based on Tensilica's XTensa processor that has instruction word sizes of 24-bit and 16-bit width.
(2) Fixed Instruction Sizes
Fixed instruction size do not feature the above-mentioned problems and hardware overhead.
3. Indexing
Indexing in code compression is a problem that arises through random access: indexing must provide the address of a jump target in the compressed space. That is because the code preceding the jump target is not being decompressed.
Hence, the jump target's address is unknown. Since compression ratios of certain code parts cannot be assumed to be constant, the jump target addresses cannot be computed either. Wolfe and Chanin <9> proposed using a table that maps uncompressed block positions (addresses) into compressed block positions. The main drawback of this method is that as the block size decreases, the overhead of storing the table increases. Another approach is to leave branches untouched during the compression phase and then patch the offsets to point to compressed space <4>. We use a similar approach here, only we compress branches as well.
4. Basic Architectures
This section gives some basic principles of architectural issues for code decompression. FIG. 1 illustrates the basic principle used by many code decompressions techniques: the instruction code is placed in the instruction memory from were it is fetched by the decompression hardware. After the code is decompressed it is passed to the CPU.
There are different issues and alternatives:
a) Memory Hierarchy
There can be a memory hierarchy in between like L1 cache, L2 cache. Performance issues and memory size issues largely depend on where exactly the decompression unit is placed.
b) Bus System
The communication infrastructure like buses might profit from compressed code being transferred, too.
Effective bandwidths can increase. Again, the impact will largely depend on where the decompression unit is placed (see also Section II-E).
c) Post-cache and Pre-cache Architectures
In order to evaluate the advantages/disadvantages of what we call pre-cache and a post-cache architecture, we have conducted simulations before we started the implementation. Specifically, we measure in this section the toggles on the bus as a metric that relates effective bus bandwidth.
The architectures are shown in FIG. 2. In the pre-cache architecture the decompression engine is placed between main memory and the instruction cache. In the post-cache architecture the same engine is located between the instruction cache (in the following we will use the shorter term I-cache instead) and the processor. Obviously, in the architecture post-cache both data buses profit from the compressed instruction code since the instructions are only decompressed before they are fed into the CPU whereas in the pre-cache architecture only DataBus 2 profits from the compressed code. In order to discuss various effects we conducted diverse experiments from which we selected the application trick. We calculated the number of bit toggles when running the application on both target architectures. The number of bit toggles are related to the the effective bandwidth (and other metrics like power consumption, for example). The results are shown in FIG. 3 for trick. It consists of three partial figures: the top one shows the number of bit toggles for DataBus 1. We showon DataBus 1 only those bit toggles that refer to cache hits.
Thus we can see how the number of hit-related toggles on DataBus 1 increases as the number of toggles on DataBus 2 (misses) decreases. The toggles on DataBus 2 are shown in the mid figure whereas the charts in the bottom figure show the sum of both. The parameter on the x-axis of all figures we have used is the cache size (given in bytes).
FIG. 2. “Pre-cache” and “post-cache” architectures for usage in code compression in a system with a multi-layered memory hierarchy
Each of those figures comprises three graphs: one shows the case where we have no instruction compression at all, one refers to the post-cache and the third to the pre-cache architecture. Starting with the top figure in FIG. 3, we can observe that the number of bit toggles increases with increasing cache size. All three architectures3 finally arrive at a point of saturation i.e. a point where the number of bit toggles does not increase any more since the number of cache hits became maximum. The two most interesting observations here are:
a) The “saturation point” is reached earlier in case of the post-cache architecture (i.e. 512 bytes) as opposed to 1024 bytes in case of the pre-cache architecture and no compression. In other words, we have effectively a larger cache. That actually means that we can afford to have a cache that is only half the size of the original cache without any loss of performance solely through locating the decompression engine where it is placed in the post-cache architecture. We can also decide to keep the same cache size. Then we can gain performance. If we do not need the increased performance then we can trade this performance increase against energy/power by slowing down the clock frequency, for example.
b) The number of toggle counts is the lowest for post-cache at a given I-cache size for reasonable sizes (a “reasonable” cache size is one where we have reached what we called the saturation point above; it provides a good compromise between cache size and number of cache misses). Thus, post-cache seems most energy efficient for DataBus 1.
The mid figure in FIG. 3 shows the number of toggles on DataBus 2. Via DataBus 2 all instructions are transferred that caused a cache miss before. Here we can observe:
a) The number of toggles is for all I-cache sizes smaller in case of post-cache architecture than in the pre-
Please note that the architectures no compression and pre-cache are almost overlayed and are showing up as only one graph. This is because of the larger effective cache size (as discussed above) that causes less cache misses and hence a smaller traffic (this relates to bit toggles) through DataBus 2.
b) Whereas we had no advantage of pre-cache architecture on DataBus 1 against architecture no compression on the same data bus, we do have an advantage here at DataBus 2 since compressed instructions are transferred here.
Now, the question is how large the overall number of bit toggles related to instruction code is on buses DataBus 1 and DataBus 2. The bottom chart in FIG. 3 gives the answer. In all reasonable I-cache configurations, post-cache architecture gives the lowest amount of bit toggles while the pre-cache architecture is actually better or almost equal to no compression in all cases. Please note that 128 bytes I-cache size does not represent a “reasonable” size since it would offer a too low performance.
We note that some modern processors have a built-in L1 cache. However, our decompression engine can be placed between an L1 and L2 cache in such cases.
D. Obstacles in Code Compression
We present some important problems when designing a code compression scheme that works in either a post-cache architecture, or an architecture that does not incorporate a cache.
1. Inability to Deduce Program Flow from the Program Counter
There are cases where it is impossible to find out whether the CPU has executed a branch or not, due to pipeline effects. Consider the following case:    bnez a5, L1    sub a2,a3,a4    addi a3,a3,1    and a2,a2,a3    L1: or a1,a2,a3
By observing the program counter values coming from the CPU it is impossible to know whether the branch is taken or not because all instructions after the bnez instruction are requested anyway due to pipeline effects. An external decompression engine will not know whether these instructions are really executed or not. This is a problem because the decompression engine may take some action due to these instructions. If for example, a call instruction appears instead of the addi instruction, the decompression engine may insert its address in the call stack.
2. Branch/jump Instructions
Handling branches, jumps, calls etc. in code compression can be a major challenge. Unless the code compression scheme provides a complete mapping for any uncompressed address to its corresponding compressed address, it is necessary to provide a mechanism to detect potential branch targets. If we assume that all potential branch targets FIG. 3. Trick application. Top: toggles on DataBus 1. Mid: toggles on DataBus 2. Bottom: sum of toggles 10 are known in the program then it is possible to devise a scheme that only provides a mapping from uncompressed branch target addresses to their corresponding compressed addresses.
However, due to the existence of jump to register or call to register instruction found in many instruction sets, it is impossible to derive all targets from the executable alone. Often these jump to register instructions load their register values from a jump table, which can be located in the executable and used to retrieve the potential targets. In some cases though, the target address is the result of arithmetic operations happening in runtime making the detection of the potential targets very hard if not impossible. Our experience with executables has shown us that certain Clanguage contructs such as switch statements, generate such code. We have not been able to solve such cases even by closely trying to follow the program flow in the executable, let alone by writing software to accomplish this. We believe this is a problem that has been overlooked in previous work in code compression.
3. Code Alignment
The following problem is a general problem that occurs virtually with any instruction set architecture. It is about code placement in the compressed space and its alignment. First, the assumptions/circumstances under which this case occurs are encountered, then the problem and possible solutions are discussed. Note that if the unknown jump targets problem is solved, then it is possible to align all jump targets to word boundaries and solve this problem. If however, in the general case, any instruction is a potential target, the code placement problem make this constraint almost impossible to follow.
Assumptions:
a) a jump occurs
b) Jump target in compressed space and jump target in uncompressed space point to different locations within a word. This is very likely since the CPC (program pointer in the compressed space) advances slower due to compression. It should be mentioned that due to other reasons (decoding etc.), in both cases, PC and CPC are aligned to byte boundaries.
c) the processor assumes to receive a full word any time when fetching takes place even when, for example, not all bytes of these full word are used to assemble the next valid instruction (note, that an instruction can be smaller than the word size).
The problem occurs because in compressed space the jump leads to an address representing a boundary such that decompression starting from this boundary will not deliver a full word without accessing the next word. In other words, in order to deliver a full word to the processor, the next word has to be accessed. This, however, requires another fetch. The problem is that another fetch needs at least one more cycle. Since the CPU cannot be stalled, other means have to be taken to prevent this case in the first place. Here is the condition for the case:f(bs(jump target; n))_word length  (1)
There, f(y) is a function that returns the number of bits in uncompressed space of a compressed bit sequence of length y. bs(a; b) is a bit sequence in compressed space starting at the a_th: position and ending at the b_th: position. jumptarget_th: is the bit position where the jump points to in compressed space whereas n is the last bit in the compressed word where the jump targets to.
Note that this problem does not occur when an instruction, sequentially following another instruction and not being fetched due to a jump, in the compressed space spans two words. In that case, the compression history assures that a full word will be delivered, even though it might contain only part of an instruction. This case is not different from conventional execution and will typically be handled by the processor hardware.