Although computers using RISC architecture are relatively new, their design philosophy and construction are already well known. For a discussion of the general background of RISC, see MIPS R2000 RISC Architecture, G. Kane, 1987, Prentice Hall, N.J.
One RISC design goal is to achieve an average instruction execution rate of one instruction per clock cycle. One way to reduce the average number of cycles required to execute an instruction without increasing the length of the clock cycle is to overlap the execution of multiple instructions. This overlapping of instructions is called "pipelining". Instruction pipelines work by dividing the execution of each instruction into several discrete portions and then executing one portion of several different instructions simultaneously. For example, the execution of an instruction might be subdivided into four portions as shown in FIG. 1.
As shown, each instruction is divided into four stages, labelled F, A, M, and W, and four clock cycles are required to execute each instruction. In this example, the mnemonic purposes of the chosen initials are unimportant and are consequently not discussed. By creating an instruction pipeline, a certain amount of operational parallelism is achieved. Parallelism here describes the fact that four instructions are being executed simultaneously. Although each instruction still requires four clock cycles to execute, if a four-level instruction pipeline is created, a new instruction can be initiated at each clock cycle. When there is an instruction executing in each stage of the pipeline, the pipeline is said to be full. Once the pipeline is full, the effective rate of instruction completion is one instruction per clock cycle. Notice in FIG. 1 that after clock cycle three, when the pipeline becomes filled, each successive clock cycle denotes the completion of one instruction. By breaking down each instruction into four smaller and simpler parts, the individual parts can be executed faster, allowing the clock cycles to be shortened. As the issue rate for new instructions depends upon the clock cycle, a shorter clock cycle leads to a higher instruction issue rate. To maintain this rate requires that the pipeline be kept full and that nothing delay the advance of instructions through the pipeline.
Problems develop in pipelined computers when the number of cycles required to complete each instruction is not identical. Different instruction latencies (the time required to perform an instruction) result in competition for the same computer resources (memory, buses, etc.) and effectively stall the pipeline. Indeed, variations in instruction latency can negate the advantages of an instruction pipeline.
These effects are particularly severe when floating point arithmetic operations are performed. Depending upon both the type of operation and on the size of the operands, the time required to execute a floating point arithmetic operation ranges from one to nineteen clock cycles in a typical computer. These differing operational latencies result in competition for the pipeline's resources, which causes the pipeline to stall, and results in fewer than one instruction per cycle being completed. Stall is used herein to describe the condition of the pipeline when a given instruction cannot be completely executed within its standard latency. Consequently other instructions must be held up or "stalled" until the given instruction clears the pipeline.
In addition to resource competition, stalls can be caused by a phenomenon known as data dependency. Frequently, successful execution of the current instruction requires results generated by a previous instruction. For example, if two Add instructions are executed consecutively and the second Add uses the result of the previous Add as one of its operands, the second Add cannot be executed until the result of the previous operation is available. Any condition which delays the execution of the first instruction will therefore delay the next and successive instructions. In the example given, the operands for the second Add are needed in an early stage of the pipeline. As the result of the first Add is not usually available until a later stage of the pipeline, the second Add must usually be stalled until the result of the first Add is available as an operand. Floating point operations are particularly vulnerable to this type of stall as they generally take longer to execute. Additionally, a double-precision floating point operation requires a total of up to 128 bits of data (two 64-bit operands). If the size of the data path into or out of the floating point coprocessor is restricted, there will be a delay or stall to allow for additional clock cycles necessary to transfer the data into or out of the floating point unit.
Due to the large variations in the time needed to complete different floating point operations and the impact of stalls caused by data dependency, the operation of the floating point coprocessor plays a major role in determining overall pipeline efficiency.
There is therefore a need for an instruction pipeline which is constructed and controlled so as to minimize instruction latency and minimize or prevent pipeline stalls of any type when performing floating point arithmetic operations, thereby improving overall computer performance.