Current microprocessors typically include a large number of data flow structures, examples include functional units for executing various types of instructions. Certain types of instructions, such as floating point instructions, require several processing steps to execute. Thus, each data flow structure typically comprises a number of individual stages, each of which is designed to perform one of the processing steps required to executed the instruction. Typically, each stage performs its operation in one clock cycle. Thus, if a data flow structure comprises four stages, then it will require four clock cycles to execute an entire instruction.
In order to increase the throughput of the data flow structure, pipelining techniques are used to allow the data flow structure to execute more than one instruction at a time. Specifically, the data flow structure is designed so that with each clock cycle, data required to execute a new instruction is passed to the first stage, while data from the previous instruction is simultaneously passed from the first stage to the second stage. A similar operation is performed for the remaining stages, so that with each clock cycle, data from one instruction is being passed into the data flow structure while processed data from a prior instruction is being passed from the data flow structure to other resources on the microprocessor.
While pipelining significantly increases throughput, conventional data flow structures still suffer from several drawbacks which limit their overall speed. For example, conventional data flow structures are designed with a fixed number of stages. Instruction data must be processed through each stage before it is output onto the data bus. Not all instructions, however, actually require processing by each stage in the structure for proper execution. For example, in a typical fused multiply-add floating point instruction unit, such as that used in the exemplary IBM POWER PC architecture, there are four stages, i.e., the multiply stage, the shift stage, the add stage and the normalization/round stage. For some instructions, such as multiply or divide, processing by all four stages is required. However, for simple floating point additions, only the add and normalization stages are required. Nevertheless, because the number of stages is fixed, execution of an add instruction still requires four clock cycles.
Another drawback to conventional data flow structures is that each stage requires a clock cycle to process data because the stages only operate on clock cycle intervals. However, different bit patterns of data require different amounts of time to actually process. Some "best case" bit patterns process in much less time than a clock cycle. But because of the synchronous operation of the stages, the data flow structure is unable to take advantage of these advantageous data patterns.
Accordingly, it is an object of the present invention to overcome the above described problems and to provide further improvements and advantages which will become apparent in view of the following disclosure.