While a CPU designer strives for generality, every application program ever produced spends most of its time in a very small portion of the code that comprises its executable file image. This is true of general programs for personal computer use, of programs for embedded computers and even for gaming platforms such as XBOX® made by MICROSOFT® Corporation of Redmond, Wash. Analysis shows that the top two or three basic blocks in the executable file image generally account for well over 80% of the total execution count.
An appealing prospect for a more efficient execution of the program is to optimize the top-running basic blocks with specialized processor instructions with the same semantic of the original sequence of general purpose software instructions but with a much more efficient implementation. Speed-ups reported in the literature range from a factor of two to a factor of six, and in some cases even larger to the tens and over. Our own experience leads us to consider a factor of three as the conservative estimate of the expected speed-up.
The CPU of modem processors implements a well-documented, fixed set of processor instructions. The processor instructions are chosen to capture the largest possible set of application requirements, in the most compact form possible. The CPU is normally realized in fixed logic in such a way that it is impossible to add any new processor instruction once the chip has been produced. On the other hand, Field-Programmable Gate Arrays (FPGA) are an alternative way to implement a CPU that does allow for later extensions and modifications, even after the chip has been deployed in the field.
It is also possible to implement the CPU with fixed logic, but with a dynamically changeable way to interconnect the internal components of the CPU. This approach can lead to new types of processors which we refer to as a “dynamically extensible processors”. These processors combine the advantages of fixed logic (reduced size, higher clock rate) with the ability to add processor extensions to the base processor instruction set.
Practical use of an extensible processor should ideally make use of the extended instructions in application programs. Ordinarily, a programmer makes use of an assembler or higher-level language compiler to write the application program. This path can require re-generating a new assembler and a new compiler for each new processor instruction. While certainly possible, this is a rather time-consuming operation. It is also fraught with limitations and dangers. If the program was in fact written in assembler, it must be rewritten. Only high-level language programs can automatically take advantage of the new instructions, provided the compiler is modified to take advantage of them. Furthermore, if we do not have the sources for the compiler, the compiler may be impossible to modify. Compilers are large and complicated programs, so it is very likely that subtle errors will be introduced. Finally, we may not have the sources for the application program or for some crucial library it makes heavy use of.
Existing tools and operating systems are designed for a fixed processor instruction set and are not able to address the needs of a dynamically extensible processor. For example, the XTENSA® processor family manufactured by TENSILICA® Corporation of Santa Clara, Calif., is supported by a standard toolset in the following way: The system designer uses special tools to define one or more processor instructions for a new processor, starting from a base processor design provided by the manufacturer. The main purpose of this tool is to help create a Verilog code for the new extended processor. The tool automatically generates a new compiler and linker based on the manual definition of the new processor instruction. Notice that this procedure is static; it requires the creation of a new chip as well as a new toolset before the application program can be compiled and optimized for it.
The following reference, a copy of which is placed on file with the United States Patent and Trademark Office, provides additional background on the design of customized and extensible processors: Clark, Blome, Chu, Mahlke, Biles, and Flautner, “An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors.” The Bibliography section refers to various other papers by the same and other authors that are also generally relevant to the work described herein.