The invention has particular relevance to the problem of efficient utilization of multiple execution units in a general purpose computer with separate instruction and data caches, so called Harvard architecture machines. The computer may be of either RISC-type, reduced instruction set computer, or CISC-type, complex instruction set computer, but a RISC set will potentially yield a higher number of instructions in parallel because of its register oriented operands which are few compared to the general operands in CISC machines which usually operate with either 16 or 32 bit operand addresses.
Current trends in the development of increasing parallelism in computers go in two main directions. One is applying multiple general purpose processors and the other is single processors that exhibit the inherent parallelism of programs by internal parallelism with simultaneous execution of multiple instructions.
The main difference between the two directions is the way the parallelism is visible to the programmer. In the multiple processor case, the coarse grained parallelism (CGP) must be utilized by programming parallel algorithms to run on the parallel machine, in the latter case fine grained parallelism (FGP) that exists in any program will be utilized without interference from the programmer.
Any form of parallelism requires housekeeping for keeping consistency in the data shared by the different computing resources in the machine. In the CGP case this housekeeping is done mainly by software with a shared memory system as the communication channel between processors. The memory hierarchy in modern computer systems contains one or more caches per processor, and for efficiency reasons the contents of the caches in a CGP system must be kept coherent by some coherency mechanisms to avoid cache flushing for all communication transfers.
In the FGP case the data consistency is obtained by shared registers between the processing elements. For this purpose a multiported register file with scoreboarding is one solution for making the coherency problem invisible to the program. In a CGP machine each processor is a general purpose computer with its own instruction fetch and issue mechanisms, and it executes a general computer program fetched from its own channel to the memory system and with its own cache.
An FGP machine has one common instruction issue unit for a number of execution units. The state of the art is an instruction issue unit, or dispatch unit, where the instructions are fetched in sequence from the instruction cache. Referring to FIG. 2, the instruction is decoded and dispatched in sequence to one or more special functional units. When the latency of a special functional unit is larger than the dispatch time, there might be parallel operations going on.
The ability of the instruction cache to issue more than one instruction per clock cycle operating on multiple operands would classify it as a multiple instruction multiple data machine (MIMD). Different instructions require different resources within the computer. A straightforward parallel fetch of multiple general instructions from the instruction cache to special functional units is quite complicated when it comes to controlling the different execution units.
Referring to FIG. 1, a full crossbar connectivity can enable instruction dispatching in parallel to special functional units. The crossbar network is placed after the decode units. When more than two instructions are to be treated in parallel, the complexity grows fast. The crossbar network needed grows by the number of instructions fetched in parallel squared. Furthermore, the extra delay added by the crossbar function at this point comes as a direct extension of the processor's cycle time.
When each execution unit is general purpose and can execute all the different processor instructions, the crossbar network is unnecessary. This scheme is depicted in FIG. 4. The simpler approach is done at the expense of hardware resources that are poorly utilized.
If the problem of routing the different instructions to their respective execution units could be solved, other obstacles that inhibit efficient execution of the parallel instructions are still present. Those are the a): dependencies between operands belonging to different instructions, b): data dependant program flow because of conditional branches and c): lack of a free, relevant execution unit.
Regarding a), the relevant data dependencies are the ones that belong to the fraction of the program that resides in the processor (at different stages of computation) at any given time. In the case of such dependencies, there are two major different solutions. The processor must keep track of the right sequence of the instructions to which the dependant operands belong and introduce extra cycles to finish the computation in correct logical order, or the compiler must produce instructions in such an order that no conflicts arise that cause the processor to perform an irrelevant computation, i.e. using wrong operands. Though, in this context no preferred solution to this problem will be discussed, one commonly preferred hardware solution is usually referred to as "scoreboarding".
Regarding b), a high frequency of branches, usually in the 15-20% range, is experienced in all programs. The normal execution of a conditional branch instruction is usually done in the following sequence. The instruction is fetched, decoded and dispatched. While the instruction in question awaits execution, the dispatching may or may not continue. When the branch condition is resolved, the processor must, when the branch is taken, calculate the new target address and continue execution from there. The reason why the conditional branch instructions may be devastating for performance, is that the processor really must await their execution before it continues fetching and dispatching other instructions.
Regarding c), the number of independent execution units limits the number of instructions that may be executed in parallel. Since any fraction of a program usually contains a mix of different instructions, the different execution units may be of different types. A general single instruction processor is usually a collection of several specialized units with some shared resources like instruction and data buses and registers. A major point is to balance the number of different execution units to the instruction mix found in most relevant computer programs. The suggested solution is to have more than one integer ALU, more than one unit for floating point operations, and more than one branch unit when the total number of execution units exceeds 5-6.