1. Field of the Invention
The present invention relates to the field of integrated circuit design, specifically to the use of optimized instruction sets within a pipelined central processing unit (CPU) or user-customizable microprocessor.
2. Description of Related Technology
RISC (or reduced instruction set computer) processors are well known in the computing arts. RISC processors generally have the fundamental characteristic of utilizing a substantially reduced instruction set as compared to non-RISC (commonly known as “CISC”) processors. Typically, RISC processor machine instructions are not all micro-coded, but rather may be executed immediately without decoding, thereby affording significant economies in terms of processing speed. This “streamlined” instruction handling capability furthermore allows greater simplicity in the design of the processor (as compared to non-RISC devices), thereby allowing smaller silicon and reduced cost of fabrication.
RISC processors are also typically characterized by (i) load/store memory architecture (i.e., only the load and store instructions have access to memory; other instructions operate via internal registers within the processor); (ii) unity of processor and compiler; and (iii) pipelining. RISC processors typically enjoy relatively high performance due to their simplicity of design. But RISC processors also sometimes suffer from a larger program size as compared to CISC processors. Since a RISC processor employs a reduced number of instructions and more limited addressing modes as compared to a CISC processor, it is common for programs targeted for a RISC processor will be larger than one targeted for a CISC processor. This fact is a consequence of requiring additional instructions to do the same function. For example, a RISC processor lacking a multiply-accumulate instruction with auto-indexing addressing modes will require two instructions to implement the multiply-accumulate and four instructions to implement the addressing modes. A CISC processor that includes this instruction will perform the operation in a single instruction.
Compressing the instruction set of a RISC processor (particularly in the case of an embedded applications such as consumer electronics) is important for reasons relating to memory and die space limitations, as well as others. A number of varying compression techniques have been proposed to reduce instruction set size. However, increased instruction set compression carries an accompanying performance loss, due to any number of factors including (i) the need to “decompress” the instructions before execution; (ii) reduced clock frequency through the extra hardware added to support the code compression; and (iii) an increased number of unused clock cycles by the compressed instructions (e.g., unused delay slots that would otherwise be used by the non-compressed analogs of the compressed instructions). The common metric for measuring compression is compression ratio (CR), which is defined generally by the formula:CR=compressed size/original size  (Eqn. 1)In the typical RISC device, the designer may generally choose from either 16-bit or 32-bit instructions. Sixteen-bit instructions are more restricted in functionality than 32-bit instructions, which generally causes the number of instructions executed in the 16-bit instruction programs to increase comparatively. However, 16-bit instructions may generally also be fetched more efficiently, thereby creating a trade-off with the increased execution time of the greater number of 16-bit instructions. The THUMB® approach of ARM is an example of such a 16-bit instruction set. Programs compiled for THUMB achieve roughly 30% smaller (as compared to the standard ARM processor instruction set), but also run on the order of 15%-20% slower on systems with 32-bit buses and no wait states. See, e.g., “Evaluation of a High Performance Code Compression Method”, C. Lefurgy, et al, University of Michigan (1999).
Another prior art approach, sometimes referred to as “CCRP”, is to specifically modify the instruction cache to run compressed programs. At the time of program compiling, the cache code lines are compressed using any number of different coding schemes (e.g., 48-bit segments compressed using Huffman codes) and stored in the processor main memory. At run-time, the cache lines are fetched from the main memory, decompressed, and loaded into the instruction cache. Accordingly, using this scheme, instructions fetched from the instruction cache are in decompressed form, and have the same addresses as those in the original program. This approach has the benefit of obviating most all modification to the core in support of the compression. However, one significant drawback relates to cache “misses” (i.e., a request to read from memory which cannot be satisfied from cache, for which the main memory must be accessed); such missed instructions do not reside at the same address in memory, and accordingly the processor must correlate or map missed instruction cache addresses to the main memory addresses where the compressed code is stored.
A somewhat similar technique known as “Codepack” (employed primarily by IBM Corporation) utilizes a separate logic mapping unit to map between the native instruction set addresses and the compressed instruction set addresses. The decompression unit accepts L1-cache (i.e., primary or on-chip cache) miss addresses, and subsequently retrieves the corresponding compressed instruction bytes from the main memory. The retrieved compressed bytes are decompressed and subsequently provides the native (decompressed) instructions to the L1-cache for use thereafter. It has been reported that such techniques may result in a compression level of 60% with a performance degradation on the order of 10% of the native program.
In “Codepack”, each 32-bit instruction word is first divided into 16-bit high and low half-words. These two half-words are then translated into a variable-length bit codeword (e.g., length ranging from 2 to 11 bits). Codepack uses two separate dictionaries for the aforementioned translation to the variable bit codewords, since half-words may have different values, and also distribution frequencies (i.e., some much more likely than others). Furthermore, the more commonly used half-word values are assigned shorter codewords. The codewords are divided into 2 sections; the first being a 2- or 3-bit tag indicative of codeword size, and a second section for indexing the translation dictionaries. The value “0” occurring in the lower half-word is encoded using only a 2-bit tag (no low index bits) based on its comparatively high frequency of occurrence. The dictionaries are fixed at time of program loading, and may be adapted for use with specific programs. Half-words not present within the dictionaries are annotated with a 3-bit marker to identify them as not being compressed bytes.
Each group of 16 instructions is combined into a so-called “compression block.” The whole compression block is fetched and decompressed if a requested instruction cache line (8 instructions) is present within the block. Note that the compressed instructions are stored at different memory locations from the non-compressed native instructions. The instruction address from the cache miss is mapped to the corresponding compressed instruction address by correlation table which is created during the compression of the instructions; each index is 32-bits long in the IBM processor. Furthermore, each entry in the correlation table maps one compression group, each compression group comprising two compression “blocks” (32 instructions in total). The first compression block of 16 instructions is specified as a byte offset into the compressed memory, while the second block is specified using a different offset from the first block.
While good overall performance (i.e., high compression ratio and low percentage loss of performance may be achieved, one of the more significant drawbacks of both CCRP and “Codepack” prior art methods described above is the increased complexity associated with compressed instruction decode. This complexity manifests itself in, among other things, the need for additional decompression, address mapping, and translation logic (as well as codeword “dictionaries” in the case of Codepack), thereby increasing the complexity and cost of any RISC design incorporating these features.
Based on the foregoing, there is a need for an improved method and apparatus for compressing the instruction sets of RISC (and other) processors, while not significantly sacrificing the benefits inherent in the design of the processor. Such improved apparatus and method would ideally result in significant compression of the base instruction set, require no processor mode switching between pure 32-bit instruction mode execution and pure 16-bit instruction mode execution, utilize comparatively simple instruction decode logic, allow for normal operation of existing exceptions/interrupts and call/return functions within the core, add no restrictions to program address range, and result in only a very small loss (i.e., on the order of 10% or less) of overall performance resulting from reduced clock frequency and other factors.