1. Technical Field of the Invention
This invention is related to an apparatus for executing one-cycle decompressions of compressed and uncompressed software instructions. More specifically, the invention processes compressed software instructions using a decompression tree and a dictionary and the decompressed software instructions are forwarded instructions to a processor without requiring any processor stalls.
2. Description of the Related Art
The following papers provide useful background information on the indicated topics, all of which relate to the invention and are incorporated herein by reference:
T. C. Bell, J. G. Cleary and I. H. Witten, Text Compression, Prentice Hall, N.J. (1990).
L. Benini, A. Maci, E. Macu and M. Poncino, Selective Instruction Compression for Memory Energy Reduction in Embedded Systems, IEEE/ACM Proc. of International Symposium on Low Power Electronics and Design (ISLPED '99), pgs. 206-11 (1999).
IBM CodePack PowerPC Code Compression Utility User's Manual, Version 3.0 (1998).
N. Ishiura and M. Yamaguchi, Instruction Code Compression for Application Specific VLIW Processors Based on Automatic Field Partitioning, Proc. of the Workshop on Synthesis and System Integration of Mixed Technologies, pgs. 105-9 (1998).
C. Lefurgy and T. Mudge, Code Compression for DSP, CSE-TR-380-98, University of Michigan (November 1998).
C. Lefurgy, E. Piccininni and T. Mudge, Reducing Code Size with Run-time Decompression, Proc. of the International Symposium of High-Performance Computer Architecture (January 2000).
S. Y. Liao, S. Devadas and K. Keutzer, Code Density Optimization for Embedded DSP Processors Using Data Compression Techniques, Proceedings of the Chapel Hill Conference on Advanced Research in VLSI, pgs. 393-99 (1995).
T. Okuma, H. Tomiyama, A. Inoue, E. Fajar and H. Yasuura, Instruction Encoding Techniques for Area Minimization of Instruction ROM, International Symposium on System Synthesis, pgs. 125-30 (December 1998).
A. Wolfe and A. Chanin, Executing Compressed Programs on an Embedded RISC Architecture, Proc. of the International Symposium on Microarchitecture, pgs. 81-91 (December 1992).
Y. Yoshida, B. -Y. Song, H. Okuhata and T. Onoye, An Object Code Compression Approach to Embedded Processors, Proc. of the International Symposium on Low Power Electronics and Design (ISLPED), ACM: 265-268 (August 1997).
There will now be provided a discussion of various topics to provide a proper foundation for understanding the invention.
Designers in the high volume and cost-sensitive embedded systems market face many design challenges as the market demands single-chip-solutions to meet constraints like lower system cost, higher functionality, higher level of performance and lower power-consumption requirements for increasingly complex applications.
Code compression/decompression has emerged as an important field to address parts of these problems in embedded system designs. Typically, a system that runs compressed code is decompressing the compressed code as it is needed by the CPU. The decompression can be performed outside the CPU (e.g., between the L2 cache and main memory or between the L2 cache and CPU) or inside the CPU as part of the instruction fetch unit or decoding unit. In any case, only parts of the whole code will be decompressed (i.e., on an instruction-level or block-level) in order to minimize the memory requirements. The code is being compressed as a step following code compilation and then it is burned into flash memory or ROM, from where it loaded during system reset into the main memory or the L2 cache. That implies that the software cannot be changed and must be fixed. Note that this characteristic prohibits the system to conduct any kind of technique using self-modifying code. If, however, the software application does change, then the code has to be compressed again, and the flash memory or ROM has to be burned again. Typically, this is not a problem for embedded systems that do not allow the user to make software changes, e.g., cellular telephones.
The primary goal in code compression has traditionally been to reduce the instruction memory size, although newer approaches have shown that it can even lead to performance increases of an embedded system. As used herein, the terms “compression” or “decompression” denote a technology that runs code that has been compressed before the system runs and where code is being decompressed when the system runs, i.e., on-the-fly.
Application areas for code compression/decompression are embedded systems in the areas of personal wireless communication (e.g., cellular telephones), personal computing (e.g., PDAs), personal entertainment (e.g., MPEGIII players) and other embedded systems where memory size and performance is a constraint.
Wolfe and Chanin developed the Compressed Code RISC Processor (CCRP), which was the first system to use cache-misses to trigger decompression. Their decompression engine is designed as part of the cache refill hardware. The instructions in each L1 cache block are Huffman encoded separately so that each block can be individually decompressed without requiring decompression of other blocks in advance. As Huffman codes are variable length codes, decoding is not as fast as with dictionary methods. Since the fixed-length cache blocks are compressed to variable-length blocks, an index table is required to map native cache-miss addresses to compressed code addresses. This requires the decompression engine to conduct one more level of lookup to find the data.
CodePack is used in IBM's embedded PowerPC systems. Their scheme resembles CCRP in that it is part of the memory system. The CPU is unaware of compression and a device similar to a Line Address Table (LAT) maps between the native and compressed address spaces. The decompression engine accepts the L1 cache miss addresses, retrieves the corresponding compressed bytes from main memory, decompresses them and returns native PowerPC instructions to the L1 cache. CodePack achieves 60% compression ratio on a PowerPC. IBM reports that performance change in compressed code is within 10% of native programs—sometimes with speedup. A speedup is possible because CodePack implements pre-fetching behavior that the underlying processor does not have.
Software decompression is also possible, simplifying the hardware design and allowing the decompression to be selected at run-time. The hardware is simplified because the decompression software uses the arithmetic unit in the processor core, rather than having separate and highly specialized logic structures. Lefurgy et al. proposed two hardware mechanisms to support software decompression. First, a L1 cache miss triggers a cache miss exception that runs the decompression program. Second, a privileged instruction used by the decompression program stores decompressed instructions directly into the instruction cache. The decompression software is not compressed and is stored in a region of memory that does not cause a decompression exception. Liao et al. proposed a dictionary method that can be accomplished purely in software, where mini-subroutines are introduced replacing frequently appearing code fragments.
Ishiura and Yamaguchi proposed a compression scheme for VLIW processors based on automated field partitioning. The compression scheme keeps the size of the decompression tables small by producing codes for sub-fields of instructions. Benini et al. limit the dictionary size by selectively compressing instructions. Lefurgy et al. also proposed a dictionary scheme used in their DSP compression work. Okuma et al. proposed an interesting encoding technique that takes into account fields within instructions. Yoshida et al. proposed a logarithmic-based compression scheme that can result in power reduction as well.
An important problem associated with code compression is how to locate branch/jump/call targets in compressed code since the original offsets will not point to “correct” locations after compression. A compressed program must be decompressed in sections determined by the flow of control through the program; decompressing the entire program before execution begins would defeat the purpose of code compression. This means that there should exist several points in the program where decompression can start without having decompressed any previous parts of the binary code. The frequency of these resynchronization points determines the block size, which is a uniquely decompressible piece of code. Some methods do not have constant block sizes, while others enforce a constant block size that can be anything from a single instruction, to a cache line, or even a page. It is important to understand that the block size refers to the uncompressed data and not to the compressed. In the invention, the block size is equal to the size of one instruction and is variable-size in the general case (since native instructions can be variable-size).
The decompressor is responsible for locating the compressed data for a requested instruction. It must map the native address of a requested instruction to the address where the compressed data is located. Techniques for implementing this mapping are described below.
The most widely used method to solve the indexing problem is to build a table that maps native addresses (addresses of instructions in the original code) to addresses in compressed code. Wolfe and Chanin were the first to propose such a method in which a Line Address Table (LAT) was used to map the address of cache lines to their address in the compressed code. Since Wolfe and Chanin's design used cache lines as their decompressible units, they only need to store addresses of beginnings of compressed lines and not branch targets. Storing a full pointer for each cache line would result in a large LAT, hence they proposed a method to compact pointers by storing one base address and a number of compressed cache line offsets in each LAT entry. To get the compressed address of a cache line, several offset fields must be added to the base address. These offsets are interpreted as byte-aligned offsets from the base address since all compressed blocks (cache lines) are byte-aligned to simplify decoding.
A drawback of the LAT approach is that the table introduces a significant amount of overhead for small blocks. Increasing the number of offsets per LAT entry can help, but makes the address calculation slower.
Another technique is to use an equation to calculate addresses in the compressed space. The easiest way to achieve this is to have a constant block size and to ensure all blocks are compressed to the same compressed size. This is a fixed-to-fixed coding scheme (F—F) and is generally not very efficient in terms of compression ratio. The most straightforward way to do this is to collect statistics of bit patterns in fixed-size blocks and re-encode with fewer bits depending on how many bit patterns combinations occur. For example if the block size is 32 bits, but only 224 combinations appear, and only 24 bits are needed to encode these blocks. However, a table that maps these different bit patterns to their original 32-bit sizes is required making this method impractical. This can be improved by splitting the original 32-bit block into sub-blocks and having a significantly smaller table for each sub-block.