Most modern micro-processors implement floating point instructions in hardware. Since a floating point instruction often requires multiple cycles of execution, a pipeline structure is implemented to allow over-lapped execution of these instructions. The pipeline implementation allows instructions to be accepted every cycle and to produce a result every cycle. Any blockages in the pipeline create stalls which in turn decrease the throughput of the pipeline and lower the performance of the micro-processor.
During the floating point computation it is often necessary to store away intermediate results. This is done through the use of a floating point store instruction which stores a specified floating point register to a specified storage address. In a micro-architecture which has in-order single instruction issue and completion it is desirable to execute the store instruction in the pipeline along with the other floating point instructions t o simplify control and minimize area.
Pipelining floating point store instructions presents a problem. A floating point store instruction may only require one cycle of execution. Executing floating point stores in the same pipeline with other floating point arithmetic instructions increases the latency of the store. The throughput of a pipeline is threatened by the occurrence of stall cycles. Therefore, it is desirable to minimize the occurrence of stall cycles. One source of stall cycles is when an instruction is data dependent on a previous instruction in the pipeline. Traditionally, the instruction is stalled at the top of the pipeline until the data can be effectively wrapped from the bottom of the pipeline into the input register. However, stalling the instruction at the top of the pipeline blocks other instructions from entering the pipeline. This is a problem addressed by our invention.
Floating point store instructions have typically been executed in dedicated load/store execution units. There are many disadvantages associated with using a separate load/store unit There is an additional area cost of the additional unit. The control sequencing of dispatching and completing from the addition unit is more complex. Additional read ports to the floating point register array are required thereby increasing its size. In order to eliminate stall cycles using a separate load/store unit, data forwarding paths are required that forward between the floating point execution unit to the load/store unit. These paths may be long and potentially limit the cycle time of the processor. In addition, the forwarding paths potentially contribute to a less than optimal wiring solution for the chip.