1. Field of the Invention
The present invention is related to the field of electronic data processing, and more particularly to a system and method of executing instructions.
2. Background Information
The current trend in microprocessors is to provide maximum speed by exploiting instruction level parallelism (ILP) both to hide long latency operations like memory accesses, and to execute multiple instructions at once. Currently the primary mechanism for doing this is an out-of-order superscalar processor. Such an approach typically uses renaming registers, reservation stations, and reorder buffers (ROBs) to hide latency and, as such, tends to rely on multiple slow, area intensive and expensive content addressable memories (CAMs). In addition, such an approach requires accurate global timing and global communication between the various structures across the entire chip. These constraints are likely to become problematic as technology advances to higher and higher clock rates. In fact, it will eventually become physically impossible to send signals from one side of the die to the other in a single clock cycle.
Counterflow processors provide a competitive alternative to the superscalar approach. Counterflow processors use highly localized communication to resolve the scheduling issues and resolution of data dependencies.
Sproull et al. first described the counterflow principle in an article entitled "The Counterflow Pipeline Processor Architecture" published in IEEE Design and Test of Computers in Fall 1994 (see R. F. Sproull, I. E. Sutherland and C. E. Molnar, "The Counterflow Pipeline Processor Architecture," IEEE Design and Test of Computers, pp. 48-59, Vol.11, No. 3, Fall 1994). Sproull described an asynchronous processor which offered a simple design methodology with many useful properties (including local control and local message passing). These concepts were used by Janik and Lu in the design of a synchronous processor (K. J. Janik and S. Lu, "Synchronous Implementation of a Counterflow Pipeline Processor" Proceedings of the 1996 International Symposium on Circuits and Systems, May 1996).
The basic counterflow processor includes two pipelines flowing in opposite directions from one another. One pipeline (the instruction pipeline or IPipe) carries the instructions up from the fetch or dispatch unit. The other pipeline (the result pipeline or RPipe) carries the operands or results of previously executed instructions down toward the dispatch unit. As an instruction and an operand pass, they "inspect" each other. The instruction checks the operands stored in the result pipeline to see if it needs any of the values. If it does, the instruction takes the operand and carries it along as it proceeds up the instruction pipeline waiting to execute. Meanwhile, the operands in the result pipeline check the instruction's destination to see if the instruction is going to update their value. If this occurs, the operands have an old copy of the result and they invalidate themselves.
If an instruction reaches its corresponding execution unit launch stage and has all of its operands, it is sent off to the execution sidepanels. If, however, it has not received its operands by this stage, it must stall, possibly stalling the instructions following it in the pipeline. Once the instruction has been sent off for execution, it proceeds up the pipeline. The execution sidepanels are clocked at the same rate as the instructions themselves. Therefore, an instruction's values are always at the same stage as the launching instruction. Upon reaching the associated recover stage, the result of the computation is loaded back into the instruction. The exception to this is the case where the execution unit has a variable latency, such as a memory execution unit. In this case, if the result has not yet been computed, the instruction has to stall at the recovery stage until the result is ready.
At any point after the instruction has retrieved a result from the execution unit, it monitors the result pipeline for an open slot. A slot is considered empty if it was invalidated by a previous instruction or it is simply empty because it hasn't been filled with anything yet. When an open slot is found, the result is sent down the result pipeline. Once the result is placed in the pipeline, the instruction will not send the result again.
The local interchange of information and the simple design of a counterflow pipeline (CFP) design support longer pipelines and increased processor throughput. Processors like those described by Sproull and Janik do, however, suffer a number of performance problems. Janik et al. describe some of these problems and a possible solution in "Advances to the Counterflow Pipeline Microarchitecture," presented at High-Performance Computer Architecture-3 in February, 1997. That article describes a Virtual Register Processor (VRP). The VRP moves the register file of the CFP processor to the bottom of the pipelines. This configuration eliminates the startup costs associated with the CFP processors, allows for a revalidate scheme that is far less expensive than a full flush on branch misprediction, and allows instructions to be removed from the instruction pipe when they are completed. In addition, by placing the register file at the bottom of the pipeline, operands no longer need to travel down the result pipeline, creating less competition for available slots in the result pipeline.
Unfortunately, allowing instructions to retire out of order eliminates the possibility of precise interrupts. To counter this Janik et al. describe the use of a reorder buffer (ROB) in combination with the VRP. In place of the register tags, all data values have a ROB tag associated with them that indicates the instruction that has generated or will generate the value. Each data value also includes a valid bit indicating whether the result has been generated yet. These tags are stored in the register file. The ROB also makes recovery from a mispredicted branch much easier.
The fundamental problem with the VRP approach is that the instruction pipeline is allowed to stall and can quickly clog the instruction flow. In addition, the VRP architecture, like the CFP processor architectures described above, is limited to only launching one instruction per clock cycle. What is needed is an architecture which provides the benefits of the CFP processor and VRP but which prevents or reduces instruction stalling. In addition, what is needed is a system and method for extending these counterflow architectures such that more than one instruction can be launched per clock cycle.