1. Field of the Invention
The present invention relates to computer hardware and more specifically relates to power consumption in a microchip.
2. Description of the Prior Art
As computer speed and computational power increase with advancing technologies, computing devices-consume more power and emit more heat. This power problem is especially apparent in general purpose computers, where the computer architecture is designed to solve generic problems. General purpose superscalar computers typically are optimized to assume random instruction sequences that can contain uncorrelated and non-repetitive instructions that load and store from one cycle to the next, where each cycle requires unique address translations and cache directory searches.
A general purpose computer architecture, such as a reduced instruction set computer (RISC), designed to solve generic problems performs its functions well. However, it does not consume power efficiently. For example, in a RISC-based computer, during the execution of special scientific applications that involve tight loops, many components in the computer are not actively used, but nevertheless consume power and emit heat.
One example of a tight loop is when a central processing unit (CPU) has all the instructions in the loop in its internal registers and does not need to fetch any additional instructions, and the CPU needs only to fetch operands for it to operate on.
One example of a primary problem in scientific computing involves long execution of tight loops such as a DAXBY floating point multiply add loop. In such an operation, the utilization of all units required is very near 100% for long periods (milliseconds). The heat generated can be greater than can be absorbed by the thermal constant of a silicon chip.
An example of DAXBY 100 is illustrated in FIG. 1. The example illustrates a tight loop of five instructions:                LFDU—Load Float Double with Update (operand 1)        LFDU'—Load Float Double with Update (operand 2)        FMADD—Float Multiply Add        STFDU—Store Float Double with Update        BC—Branch Conditional        
In a RISC-based computer, when this loop is executed, instructions are fetched from an instruction cache (Icache) and operands are fetched from a data cache (Dcache). The address of operands is stored in a register during the execution cycle and the result of calculation is stored in a register. The result in the register is read a few cycles later and sent to the Dcache, from where it is written back to the memory. The registers are mostly used for timing purposes during the execution cycle and separate memory access from the actual computation.
The clock gating of unused components, such as effective address generation and register file reads and writes, is of no value for the loop case, as all function is required every cycle.
Peak Dcache power can be avoided by banking the cache into 16 or more double-wide (DW) interleaved banks (4 KB each for a 64 KB L1 Dcache) as shown in FIG. 2. Such arrangement reduces Dcache power at 100% load and 100% store utilization by a factor of 16, because only the bank containing the required data is read or written.
However, for functional units in a tight loop case, where there is almost 100% of utilization of all units, the power and power density can be too high and would greatly limit the operating frequency of the processor core. The excessive power consumption and heating cause a severe cooling and reliability problem.
Therefore, there is a need for a system to reduce power consumption for loop codes.