1. Field of the Invention
The invention relates to the field of computer processor architecture. More particularly, the present invention relates to the field of distributed control in a computer processor.
2. Description of the Related Art
The processing speed of a computer with a processor having a single execution unit is governed primarily by two factors. The first factor is the instruction cycle time. The second factor is the parallelism or complexity of the architecture/microarchitecture and its implementation i.e., the total amount of computation the execution unit can accomplish with each instruction. Hence, the effective processing speed of such a processor can be boosted by decreasing the cycle time, increasing the parallelism of the microarchitecture, or both.
However, the two factors described above often compete with each other. Optimizing one factor may result in a compromise of the other factor. For example, if the architecture is enhanced to include an instruction for multiplying two dimensional arrays of floating point numbers, the cycle time of such an instruction will increase. Conversely, in order to minimize the average cycle time, the instruction set should include only simple instructions. While pipelining can mitigate the increase in cycle time associated with complex instructions, pipelining also results in increased instruction latency.
In the quest for higher performance, various processor architectures have been explored, including data flow processors with multiple execution units capable of simultaneously executing multiple instructions. Commercial implementations include LSI Logic's "Lightning" and Hyundai Electronic's "Thunder" SPARC processors. The basic architecture for a data flow processor evolved from the observation that although conventional software instructions are generally written and executed in a serial program order, in practice operand interdependencies of instructions are rarely at a 100%, i.e., it is rare that every instruction cannot be executed until all previous instructions have been executed. Rather, instruction operand interdependencies typically are less than 50%. Therefore, it is possible for several independent instructions to be processed simultaneously by separate execution units in a single cycle. Further, the instruction operand interdependencies can be reduced by using parallelism oriented software programming languages and compiler techniques.
FIG. 1 is a block diagram showing a conventional data flow processor architecture having multiple execution units. Data flow processor 100 includes an instruction cache memory 110, a prefetch buffer 120, a centralized instruction scheduler 130 with a speculative register file 135, a plurality of execution units 141, 142, . . . 149, and a register file/memory 190. Prefetched buffer 120 is coupled between instruction cache memory 110 and centralized instruction scheduler 130. Register file/memory 190 is coupled to centralized instruction scheduler 130. Each of the plurality of execution units 141, 142, . . . 149 are coupled to centralized instruction scheduler 130.
As each instruction is processed by data flow processor 100, the instruction acquires one of four status, pending, issued, completed and retired. An instruction is pending, when the an instruction has been fetched from instruction cache 110 and is stored in prefetch buffer 120 while waiting for an available execution unit and/or for one or more source operands. In an issued state, the instruction has been issued by centralized instruction scheduler 130 to execution unit 141 and the source operand(s) of the instruction are available. Next, the instruction is completed, i.e., executed by execution unit 141. Finally, the instruction is retired. When retired, the appropriate execution unit, e.g., unit 141, is responsible for returning status and result values of the retired instruction to speculates register file 135 of centralized instruction scheduler 130 so as to update the corresponding instruction status of processor 100. In turn, centralized instruction scheduler 130 returns resources such as execution unit 141 to the free pool of execution units and transfers the destination operand value of the retired instruction to register file 190.
Since centralized instruction scheduler 130 has multiple execution units 141, 142, . . . 149, theoretically, data flow processor 100 can have a maximum of N instructions issued, where N is equal to the total number of execution units multiplied by the number of stages in each execution unit. Hence, with an ideal set of instructions, where there are no operand interdependencies, N issued instructions can also be executed simultaneously. In practice, there will usually be some operand interdependencies between the N issued instructions and the execution of some of the issued instructions will have to be deferred while awaiting source operand values which are available after the completion of all other older instructions. Speculative register file 135 serves as a temporary depository for the operand values necessary for satisfying operand interdependencies between instructions issued to the separate execution units. Register file 135 is different from a programmer visible or a permanent register file. For example, programmer visible file may contain temporary register values that may be invalidated when the corresponding register setting instruction is aborted during execution. Instead, register file 135 is transparent or invisible to the programmer and is used as a private pool of registers accessible by centralized instruction scheduler 130. As such, a moderate amount of concurrent processing can take place, thereby enabling data flow processor 100 to accomplish a high instruction throughput relative to the conventional processor with a single execution unit.
Unfortunately, this moderate increase in processing speed is accompanied by an exponential increase in the control circuitry complexity of processor 100 for scheduling every instruction and keeping track of the status of all execution units 141, 142, . . . 149. This control and operand information maintained by centralized instruction scheduler 130 increases significantly as the number of execution units 141, 142, . . . 149 is increased and/or the number of pending instructions in prefetch buffer 120 is increased. As such, conventional data flow processor 100 is severely limited by the capability of its single instruction scheduler 130 to maintain this high volume of control and data information. In addition, the scalability of processor 100 is limited by the increased circuit complexity of centralized instruction scheduler 130 required for controlling execution units 141, 142, . . . 149.