1. Field of the Invention
The present invention is related to the field of electronic data processing, and more particularly to a system and method of executing instructions.
2. Background Information
The current trend in microprocessors is to provide maximum speed by exploiting instruction level parallelism (ILP) both to hide long latency operations like memory accesses, and to execute multiple instructions at once. Currently the primary mechanism for doing this is an out-of-order superscalar processor. Such an approach typically uses renaming registers, reservation stations, and reorder buffers (ROBs) to hide latency and, as such, tends to rely on multiple slow, area intensive and expensive content addressable memories (CAMs). In addition, such an approach requires accurate global timing and global communication between the various structures across the entire chip. These constraints are likely to become problematic as technology advances to higher and higher clock rates. In fact, it will eventually become physically impossible to send signals from one side of the die to the other in a single clock cycle.
Counterflow processors provide a competitive alternative to the superscalar approach. Counterflow processors use highly localized communication to resolve the scheduling issues and resolution of data dependencies.
Sproull et al. first described the counterflow principle in an article entitled xe2x80x9cThe Counterflow Pipeline Processor Architecturexe2x80x9d published in IEEE Design and Test of Computers in Fall 1994 (see R. F. Sproull, I. E. Sutherland and C. E. Molnar, xe2x80x9cThe Counterflow Pipeline Processor Architecture,xe2x80x9d IEEE Design and Test of Computers, pp. 48-59, Vol. 11, No. 3, Fall 1994). Sproull described an asynchronous processor which offered a simple design methodology with many useful properties (including local control and local message passing). These concepts were used by Janik and Lu in the design of a synchronous processor (K. J. Janik and S. Lu, xe2x80x9cSynchronous Implementation of a Counterflow Pipeline Processorxe2x80x9d Proceedings of the 1996 International Symposium on Circuits and Systems, May 1996).
The basic counterflow processor includes two pipelines flowing in opposite directions from one another. One pipeline (the instruction pipeline or IPipe) carries the instructions up from the fetch or dispatch unit. The other pipeline (the result pipeline or RPipe) carries the operands or results of previously executed instructions down toward the dispatch unit. As an instruction and an operand pass, they xe2x80x9cinspectxe2x80x9d each other. The instruction checks the operands stored in the result pipeline to see if it needs any of the values. If it does, the instruction takes the operand and carries it along as it proceeds up the instruction pipeline waiting to execute. Meanwhile, the operands in the result pipeline check the instruction""s destination to see if the instruction is going to update their value. If this occurs, the operands have an old copy of the result and they invalidate themselves.
If an instruction reaches its corresponding execution unit launch stage and has all of its operands, it is sent off to the execution sidepanels. If, however, it has not received its operands by this stage, it must stall, possibly stalling the instructions following it in the pipeline. Once the instruction has been sent off for execution, it proceeds up the pipeline. The execution sidepanels are clocked at the same rate as the instructions themselves. Therefore, an instruction""s values are always at the same stage as the launching instruction. Upon reaching the associated recover stage, the result of the computation is loaded back into the instruction. The exception to this is the case where the execution unit has a variable latency, such as a memory execution unit. In this case, if the result has not yet been computed, the instruction has to stall at the recovery stage until the result is ready.
At any point after the instruction has retrieved a result from the execution unit, it monitors the result pipeline for an open slot. A slot is considered empty if it was invalidated by a previous instruction or it is simply empty because it hasn""t been filled with anything yet. When an open slot is found, the result is sent down the result pipeline. Once the result is placed in the pipeline, the instruction will not send the result again.
The local interchange of information and the simple design of a counterflow pipeline (CFP) design support longer pipelines and increased processor throughput. Processors like those described by Sproull and Janik do, however, suffer a number of performance problems. Janik et al. describe some of these problems and a possible solution in xe2x80x9cAdvances to the Counterflow Pipeline Microarchitecture,xe2x80x9d presented at High-Performance Computer Architecture-3 in February, 1997. That article describes a Virtual Register Processor (VRP). The VRP moves the register file of the CFP processor to the bottom of the pipelines. This configuration eliminates the startup costs associated with the CFP processors, allows for a revalidate scheme that is far less expensive than a full flush on branch misprediction, and allows instructions to be removed from the instruction pipe when they are completed. In addition, by placing the register file at the bottom of the pipeline, operands no longer need to travel down the result pipeline, creating less competition for available slots in the result pipeline.
Unfortunately, allowing instructions to retire out of order eliminates the possibility of precise interrupts. To counter this Janik et al. describe the use of a reorder buffer (ROB) in combination with the VRP. In place of the register tags, all data values have a ROB tag associated with them that indicates the instruction that has generated or will generate the value. Each data value also includes a valid bit indicating whether the result has been generated yet. These tags are stored in the register file. The ROB also makes recovery from a mispredicted branch much easier.
The fundamental problem with the VRP approach is that the instruction pipeline is allowed to stall and can quickly clog the instruction flow. In addition, the VRP architecture, like the CFP processor architectures described above, is limited to only launching one instruction per clock cycle. What is needed is an architecture which provides the benefits of the CFP processor and VRP but which prevents or reduces instruction stalling. In addition, what is needed is a system and method for extending these counterflow architectures such that more than one instruction can be launched per clock cycle.
According to one aspect of the present invention, what is described is a system and method of executing instructions within a counterflow pipeline processor. The counterflow pipeline processor includes an instruction pipeline, a data pipeline, a reorder buffer and a plurality of execution units. An instruction and one or more operands issue into the instruction pipeline and a determination is made at one of the execution units whether the instruction is ready for execution. If so, the operands are loaded into the execution unit and the instruction executes. The execution unit is monitored for a result and, when the result arrives, it is stored into the result pipeline. If the instruction reaches the end of the pipeline without executing it wraps around and is sent down the instruction pipeline again.
According to another aspect of the present invention, what is described is a processor and a computer system built using the processor. The processor includes an instruction pipeline having a plurality of stages, a result pipeline having a plurality of stages, an execution unit connected to the instruction pipeline and the result pipeline and a reorder buffer. The reorder buffer supplies instructions and operands to the instruction pipeline and receives results from the result pipeline. The instruction pipeline and the result pipeline wrap around the reorder buffer to create counter rotating queues. The execution unit includes an operand input and a result output, wherein the operand input receives an operand from the instruction pipeline. The execution unit transmits a result to the result output as a function of the operand received by the operand input.
According to yet another aspect of the present invention, what is described is a processor having an instruction pipeline, a result pipeline, first and second execution units and first and second reorder buffers. The first and second execution units are connected to first and second stages, respectively, of the instruction pipeline and the result pipeline. The first reorder buffer supplies instructions and operands to the first stage of the instruction pipeline and receives results from the first stage of the result pipeline. The second reorder buffer supplies instructions and operands to the second stage of the instruction pipeline and receives results from the second stage of the result pipeline.
According to yet another aspect of the present invention, what is described is a computer system having memory and a processor, wherein the processor is capable of executing a plurality of instructions, including a first instruction. The processor comprises a plurality of instruction pipelines, a plurality of result pipelines and a plurality of reorder buffers. Each reorder buffer receives instructions from one instruction pipeline and issues instructions to a second instruction pipeline. In addition, each reorder buffer receives data from one result pipeline and issues data to a second result pipeline. Each reorder buffer includes a register file having a plurality of registers, each register having a data entry and a tag field, and a register alias table having a plurality of register alias table entries, wherein each register alias table entry includes a pipeline field and a register field, wherein the pipeline field shows which instruction pipeline the first instruction was dispatched into and wherein the register field show the register into which the first instruction will write its result.
According to yet another aspect of the present invention, what is described is a method of executing more than one thread at a time. A first and a second reorder buffer are provided. First instructions and first operands associated with the first thread from the first reorder buffer are read and executed, with the result stored in the first reorder buffer, wherein storing the result includes marking the result with a tag associating the result with the first thread. Second instructions and second operands associated with the second thread from the second reorder buffer are read and executed, with the result stored in the second reorder buffer, wherein storing the result includes marking the result with a tag associating the result with the second thread.
According to yet another aspect of the present invention, what is described is a method of recovering from incorrect speculations in a counterflow pipeline processing system having an instruction pipeline and a data pipeline, both of which feed back into a reorder buffer. A mispredicted branch having a first instruction is detected and all instructions occurring after the mispredicted branch are invalidated in the reorder buffer. If the first instruction is in the instruction pipeline and can execute, the instruction is executed and the results associated with that instruction are invalidated when they reach the reorder buffer. If the instruction reaches the end of the instruction pipeline, it is deleted.
According to yet another aspect of the present invention, what is described is a method of controlling data speculation. An instruction is provided and an operand associated with the instruction is obtained. A check is made as to whether the operand is valid and whether the operand is a speculative value and the operand is marked accordingly. The instruction is then executed in order to generate a result as a function of the operand and, if the operand was a speculative value, checking the operand for a nonspeculative value for the operand, comparing the nonspeculative value against the speculative value and, if the speculative value was correct, saving the result.