This invention relates to computer systems that execute floating point instructions, and more particularly, to a method and system for processing mixed precision floating point store instructions in a floating point pipeline to increase performance and robustness of data dependency resolution.
Most modern micro-processors implement floating point instructions in hardware. Since a floating point instruction often requires multiple cycles of execution, a pipeline structure is implemented to allow over-lapped execution of these instructions. The pipeline implementation allows instructions to be accepted every cycle and to produce a result every cycle. Any blockages in the pipeline create stalls which in turn decrease the throughput of the pipeline and lower the performance of the micro-processor.
During the floating point computation it is often necessary to store away intermediate results. This is done through the use of a floating point store instruction which stores a specified floating point register to a specified storage address. In a micro-architecture which has in-order single or multiple instruction issue and completion, it is desirable to execute the store instruction in the pipeline along with the other floating point instructions to simplify control and to minimize area.
Pipelining floating point store instructions presents a problem. A floating point store instruction may only require one cycle of execution. Executing floating point stores in the same pipeline with other floating point arithmetic instructions increases the latency of the store. The throughput of a pipeline is threatened by the occurrence of stall cycles. Therefore, it is desirable to minimize the occurrence of stall cycles. One source of stall cycles is when an instruction is data dependent on a previous instruction in the pipeline. Traditionally, the instruction is stalled at the top of the pipeline until the data can be effectively wrapped from the bottom of the pipeline into the input register. However, stalling the instruction at the top of the pipeline blocks other instructions from entering the pipeline.
Floating point store instructions have typically been executed in dedicated load/store execution units. There are many disadvantages associated with using a separate load/store unit. There is an additional area cost of the additional unit. The control sequencing of dispatching and completing from the addition unit is more complex. Additional read ports to the floating point register array are required, thereby increasing its size. In order to eliminate stall cycles using a separate load/store unit, data forwarding paths are required that forward between the floating point execution unit to the load/store unit. These paths may be long and potentially limit the cycle time of the processor. In addition, the forwarding paths potentially contribute to a less than optimal wiring solution for the chip.
A floating point unit typically consists of several pipeline stages, such as multiple pipeline stages for arithmetic computation (e.g., addition and multiplication), a normalization stage, and a rounding stage. Each pipeline stage may contain a separate instruction and the stages are connected in an ordered manner. As an instruction enters the pipeline, the necessary input data operands are accessed and put into the first stage of the pipeline. The instruction advances from stage to stage within the pipeline as permitted. An instruction is considered to “stall” within the pipeline when forward progress is not allowed. An instruction is not permitted to advance to a new stage in the pipeline when the successive pipeline stage contains another previous instruction that itself cannot advance. An instruction cannot commence to operate until it has data to operate on. It may not have data to operate upon when an earlier instruction will update the data that a successive instruction will operate upon. This is referred to as a data dependency. For this reason, the successive instruction will “stall” at the entrance to the pipeline until it receives the updated data.
It would be desirable to be able to allow a store to issue into a pipeline before the data associated with the store is available. This would result in a smaller elapsed time between, for example, an arithmetic instruction and a subsequent store instruction that is dependent on data from the arithmetic instruction. This has been performed for architectures that treat the register file as bits of memory without any format conversion (e.g., IBM zSeries). This method applies when there is not a data type mismatch in the pipeline. See, for example, U.S. Pat. No. 6,049,860 to Krygowski et al., of common assignment herewith. It would be desirable to extend the concept of “store folding” as disclosed in the referenced patent to architectures where the register file is treated as a given format (e.g., IBM PowerPC).