1. Field of the Invention
The present invention generally relates to computer systems and processors. More particularly, the present invention relates to a system and method for reducing the power consumed in a floating point unit of a processor through minimizing the bits participating in the floating point calculations performed in a loop.
2. Description of the Prior Art
Power conservation is increasingly becoming a concern in both computer systems and processor design. The components of the processor, such as the logic gate transistors, buses and registers, generate heat from their electrical conductance in computer operations. The dramatic increase of chip components on a processor has exacerbated the problems associated with heat generation on the processor, as more components yield more heat during operation.
There have been several attempts in the prior art to alleviate processor power consumption problems. One method is to simply have the processor operate at lower power levels and clock frequency. Another solution has been to create modes within the processor that deactivate system power to components in a computer system when not in use. The processors include power-down circuitry that controls the power delivered to functional units of the processor, and the individual units of the processors have the power cut to them when it is determined that the unit is not necessary during the current operational cycle. However, this system adds to manufacturing costs of the processor, and creates significant overhead in activating and deactivating the units of the processor to affect overall performance of the processor.
Furthermore, in specific computer programs, a large iterative sequence can reuse the same series of components such that the components can become overheated and damaged from execution of the iterative program. The constant use of a particular series of processor components is acute in scientific computing that utilizes tight loop computing, such as a DAXPY floating point multiply add loop. In a prior art 64-bit Floating Point Multiply Adder (FPMADD) shown in FIG. 1, the utilization of the FPMADD approaches 100% since the entire FPMADD unit is used each cycle. At very high clock frequencies, and especially when dynamic logic in employed, the power generated by a 64-bit FPMADD unit will be very high, possibly 3 to 5 times that of the floating point unit and the power density of the multiply add unit can approach 3 to 5 times the maximum allowable of about 1 Watt/mm2.
One feature provided in state of the art processors is the availability of floating point operations. In early designs, because of processor design complexity, such features were provided via a separate co-processor. In modern processors, such floating-point functionality has been provided in the main processor in a floating point unit, and most modern processors clock the floating point circuitry, even though no floating point operations are currently executed, or floating point registers used. The floating point unit and processor are actuated by micro-code instructions that direct the loading and storing of floating point calculations.
Furthermore, most modern processors implement floating point instructions in hardware wherein a floating point instruction often requires multiple cycles of execution, and therefore a pipeline structure can be implemented to allow over-lapped execution of particularly the floating point instructions. In such pipelined implementation, floating point instructions can be accepted every cycle to produce a result every cycle. And any miscalculations or blockages in the pipelined instructions create “stalls” which in turn decrease the throughput of the pipeline and lower the performance of the processor. During the floating point computation, it is often necessary to store away intermediate results. This is done through the use of a “floating point store” instruction that stores the value in a specified floating point register to a specified storage address. In a micro-architecture that has in-order single instruction issue and completion, it is desirable to execute the store instruction in the pipeline along with the other floating point instructions to simplify control and minimize usage of chip components.
Pipelining floating point store instructions, however, presents several problems. A floating point store instruction may only require one cycle of execution. And because executing floating point stores in the same pipeline with other floating point arithmetic instructions increases the latency of the store, the throughput of a pipeline is threatened by the occurrence of stall cycles. It is thus desirable to minimize the occurrence of stall cycles within the floating point pipeline. One source of stall cycles occurs when an instruction is data dependent on a previous instruction in the pipeline. Traditionally, the instruction is stalled at the top of the pipeline until the data can be effectively wrapped from the bottom of the pipeline into the input register, but stalling the instruction at the top of the pipeline blocks other instructions from entering the pipeline.
Another problem with pipelining occurs with floating point store instructions that are executed in dedicated load/store execution units. The control sequencing of dispatching and completing from an additional floating point instruction load/store unit is complex with additional read ports to the floating point register array required. In order to eliminate stall cycles using a separate load/store unit, data forwarding paths are required that forward between the floating point execution unit to the load/store unit. These paths are long and limit the cycle time of the processor, and require additional chip components and chip area for implementation.
It would therefore be advantageous to provide a system and method that can reduce the power consumed in a tight loop of floating point calculations though propagation of data that minimizes the component usage during successive iterations. Such system and method should be robust and not require significant overhead in processor manufacture or operation. Nor should the system and method unnecessarily operate the circuitry of processor or co-processor in assisting the floating point unit in the iterative calculations. It is thus to the provision of such a system and method that the present invention is primarily directed.