The present invention relates generally to microprocessors which are fabricated on a semiconductor chip or die using very large scale integrated (VLSI) circuit technology, and more particularly to such microprocessors whose complexity is rendered adaptive.
For the sake of general background information, which is well known in the art, and to define terms used in this specification, it is deemed worthwhile to review some basic concepts of microprocessors. In its simplest form, a microprocessor consists of an arithmetic logic unit (ALU) which performs arithmetic and logic operations on data supplied to the ALU; a register unit comprising a plurality of registers in which to store data temporarily during the execution of a program (the program consisting of a series of instructions); and a control unit which supplies timing and control signals to transfer data to and from the microprocessor, to perform programmed instructions, and to perform other operations. Buses are used to transfer information internally and externally of the microprocessor, and typically include an address bus by which an encoded address of a memory location whose data contents are to be accessed is sent to an external memory; a data bus for transfer of data or instruction codes into the microprocessor and transfer of computations or operational results out of the microprocessor; and a control bus for coordinating the operations of the microprocessor and communication with external devices. Multiple ALUs as well as other components may be employed in a single microprocessor, as is the case, for example, with present-day superscalar microprocessors.
Microprocessors evolved from chip sets which were originally designed to be interfaced to work together, to single chip devices generally designed to handle a wide variety of different and diverse applications. From the concept of the earliest designs of applying an individualized approach in which virtually each transistor was laid out and optimized to its environment, the design has evolved to a hierarchical approach in which the processor is composed of a multiplicity of modules, each of which is composed of a plurality of cells. Given the microprocessor, a microcomputer is produced by connecting the microprocessor to the memory unit, as well as to input and output units by which the microprocessor responds to and affects its environment. Alternatively, memory and other components may be fabricated on the same chip as the microprocessor itself. The microprocessor and associated support circuits constitute a central processing unit (CPU) of the microcomputer which serves to address the desired memory location, fetches a program instruction stored there, and executes the instruction. Cache memory, or simply cache, is high speed local memory which is used to increase the execution speed of a program (i.e., the throughput) by storing a duplicate of the contents of a portion of the main memory. A cache controller is used to automatically or selectively update main memory with contents of the cache according to the nature of the controller. By pre-fetching one or more instructions through use of a bus interface unit (BIU), and storing them in an instruction queue while the CPU is executing instructions, throughput can be increased markedly.
System architecture is established according to register and instruction format. In the Harvard architecture and certain reduced instruction set computer (RISC) variations of that architecture, separate program and data memories may be provided, which can also improve performance of the device by allowing instructions to be fetched in advance and decoded at the same time that data is being fetched and operated on by the CPU. Use of off-chip memory improves the efficiency of a microprocessor and makes eminent sense because of the broad diversity of applications the device is designed to handle. A special form of microprocessor which generally includes on-chip memory as well as possibly other peripherals is the microcontroller, which tends to be more application-specific at least within somewhat narrow lines of applications.
Current microprocessor designs set the functionality and clock rate of the chip at design time based on the configuration that achieves the best overall performance over a range of target applications. The result may be poor performance when running applications whose requirements are not well-matched to the particular hardware organization chosen.
Computer architects generally strive to design microarchitectures whose hardware complexity achieves optimal balance between instructions per cycle (IPC) and clock rate such that performance is maximized for the range of target applications. Various features such as wide issue windows and large so-called L1 caches can produce high IPC for many applications, but if a clock speed degradation accompanies the implementation of these large structures, the result may be lower performance for those applications whose IPC does not improve appreciably. The latter applications may display improved performance with a less aggressive microarchitecture emphasizing high clock rate over high IPC. It follows that for a given set of target applications, there may be endless combinations of features leading to different clock speeds and IPC values which achieve almost identical mean performance.
For example, the Digital Equipment Corporation Alpha 21164 microprocessor (see J. Edmondson et al, "Internal organization of the Alpha 21164, a 300 MHZ 64-bit quad-issue CMOS RISC microprocessor," Digital Technical Journal, 7(1):119-135, Special Issue 1995), referred to herein as the 21164, and the Hewlett Packard (HP) PA-8000 CPU (see A. Kumar, "The HP PA-8000 RISC CPU," IEEE Computer, 17(2):27-32, March 1997), referred to herein as the PA-8000, achieve almost identical SPECfp95 performance baseline results (see Microprocessor Report, 11(9):23, Jul. 14, 1997). Yet, each takes a very different approach to do so. The 21164 has a clock rate roughly three times that of the PA-8000 by use of streamlined in-order design and small (8 KB) L1 caches as well as aggressive implementation technology and circuit design. On the other hand, the PA-8000 provides a 56-entry out-of-order instruction window and multi-megabyte L1 caches which may be implemented off-chip.
While other factors certainly play a role in performance results, both of these implementations may suffer severe performance degradation on applications whose characteristics are not well-matched to the IPC/clock rate tradeoff point in the respective design of key hardware structures. For example, applications with frequently-accessed megabyte-sized data structures that do not fit in the on-chip cache hierarchy may perform less well on the 21164, since the processor is caused to frequently run at the speed of the board-level cache, than on the PA-8000 with its multi-megabyte L1 Dcache, even at its lower clock speed. Conversely, applications with small working sets and little exploitable instruction-level parallelism (ILP) may effectively waste the large cache (Dcache) and instruction window of the PA-8000, and run more efficiently on the faster 21164.
Thus, diversity of hardware requirements from application to application forces microarchitects to implement hardware solutions that perform well overall, but which may compromise individual application performance. Worse, diversity may exist within a single application, where it has been found, for example, that the amount of ILP varies during execution by up to a factor of three (see D. Wall, "Limits of instruction-level parallelism," Technical Report 93/6, Digital Western Research Laboratory, November 1993). Hence, even implementations that are well-matched to the overall requirements of a given application may still exhibit suboptimal performance at various points of execution.
Configurable architectures have been proposed (see, e.g., A. DeHon et al, "MATRIX: A reconfigurable computing device with configurable instruction distribution," Hot Chips IX Symposium, August 1997) to replace fixed hardware structures with reconfigurable ones so as to allow the hardware to dynamically adapt at runtime to the needs of the particular application. In general, however, these approaches are intrusive and may lead to decreased clock rate and increased latency, both of which may override the performance benefits of dynamic configuration. Accordingly, configurable architectures are currently relegated to specialized applications and have yet to be proven effective for general-purpose use.
It would be desirable to provide the flexibility of a non-intrusive or low-intrusive, evolutionary approach to implementing configurability within conventional microprocessors. It is a principal objective of the present invention to do so, and this is achieved, according to the invention, by means of a device and method referred to from time to time herein as a Complexity-Adaptive Processor.TM., or CAP.TM., microprocessor device.