An important design goal in the design of processors for computers is to increase the speed with which a processor executes a program. One way to achieve this goal is to execute more than one operation at the same time. This approach is generally referred to as parallelism, and sometimes, "Instruction level parallelism."
Some processors support parallelism by simultaneously issuing operations to independent functional units. The functional units can execute independent operations at the same time. Two types of processors with this capability are superscalar and Very Long Instruction Word (VLIW) machines. The term "Superscalar" usually refers to a machine that can issue multiple operations per clock cycle and includes special hardware to ensure that these operations are not dependent on each other. The term "VLIW" usually refers to machines that rely on the compiler to generate instructions having multiple operations that can be issued simultaneously. Both types of machines can issue multiple operations to independent functional units at the same time. The difference between the two is that the VLIW machine does not make any decision as to whether the operations in an instruction can be issued simultaneously, but instead, relies on the compiler to place independent operations in each instruction.
Another way to support parallelism is to use pipelining. Pipelining is a technique in which the execution of an operation is partitioned into a series of independent, sequential steps called pipeline segments. A typical pipeline might have the following stages: 1) fetch operation; 2) decode operation; 3) read all registers; 4) execute the operation; and 5) write the results to target registers. Pipelining is a form of instruction level parallelism because more than one operation can be processed in a pipeline at a given point in time. Note that even if a processor allows only one operation per instruction, it can still take advantage of parallel processing using a pipelined functional unit to process each instruction. Many modem computers use multiple, pipelined functional units, and therefore use both forms of parallelism outlined above.
An important restriction to parallelism is the interdependency between operations in the program. Generally, a first operation is dependent on a second operation if it uses the result of the first operation. When an operation is dependent on another, the processor cannot completely process the operation until the results of the operation on which depends are available. Conversely, if an operation is not dependent on another operation, or the dependencies of the operation are satisfied, then the processor can execute the operation, even if it is not executed in the precise order specified in the program. To maximize parallelism, software, hardware, or some combination of both can take a program and optimize it so that the processor's capacity to execute operations at the same time is utilized as much as possible. One way to maximize parallelism using software is to have the compiler analyze dependencies among operations and schedule operations such that independent operations are processed simultaneously. An example of this approach is a compiler for a VLIW machine that can place multiple operations in a single instruction. Processors can also include special hardware to dynamically check for dependencies and schedule operations to execute in parallel.
When the hardware dynamically schedules operations, it must ensure that dependencies between these operations are properly enforced. One way to enforce these dependencies is to use hardware called a "scoreboard" or interlock hardware. The scoreboard or interlock hardware allows the programmer to assume that all operations occur within a single cycle. However, since many operations take more than one cycle, the scoreboard or interlock hardware must detect the dependency of subsequent operations with the result of a previous, incomplete operation. Specifically, if an operation takes more than one cycle to complete and the subsequent operation is dependent upon the previous instruction, then the processing system must function to protect the dependency by stalling the issuance of the subsequent operation until the result of the previous, incomplete operation is obtained.
An example of a known scoreboard or interlock hardware approach is employed in the M32/100 microprocessor (Mitsubishi Electric Corporation). The M32/100 microprocessor has a five stage instruction execution pipeline and uses a hardware interlock mechanism with scoreboard registers to preserve the precedence of an instruction stream. More particularly, the M32/100 has multiple scoreboard registers which correspond to the pipeline stages. A bit is set in the scoreboard register which corresponds to the resource (register) receiving the result of the current instruction. The bit is shifted to the next stage scoreboard register synchronized with the flow of the instruction in the pipeline. When the execution of the instruction is completed, the bit is shifted out of the scoreboard registers. For example, if an instruction with a memory operand reaches the A-stage of the pipeline and tries to fetch the contents of a register or memory with a set scoreboard bit in at least one scoreboard register, the instruction pipeline stalls. The M32/100 microprocessor is more fully described in Yoshida et al., "A Strategy for Avoiding Pipeline Interlock Delays in a microprocessor," 1990 IEEE International Conference on Computer design: VLSI in Computers and Processors, Cambridge, Mass. (1990).
Another example of a scoreboard or interlock hardware approach uses tag and ready bit information with each register entry. The ready bit indicates whether the data in a register entry is valid, while the tag indicates the version of the register. Incorrect execution is prevented because operations whose operands are not ready are sent to a wait area until their operands are ready. As a result, instructions which are not ready do not necessarily stall the machine. This approach is more fully described in Uvieghara et al., "An Experimental Single-Chip Data Flow CPU," IEEE Journal of Solid-State Circuits, Vol. 27, No. 1, January 1992.
The special precautions (which prevent the issuance of subsequent dependent operation before the results of the previous, incomplete operations are obtained) can also be designed into the program. Namely, a class of machines exist which have user visible latencies which may coincide with actual hardware latencies. For these machines, programs can be written which guard against premature use of data by inserting a requisite number of operations between an operation and a use of its result. These inserted operation are placed in what is referred to as user visible delay slots. The number of user visible delay slots between an operation and use of its result depends on the latency. A processing system which executes a program having user visible delay slots need not use complex scoreboard or interlock hardware to determine if dependency exists.
User visible delay slots are programmed into code by a programmer or a compiler. The programmer or compiler uses detailed knowledge about resource availability and operation timing of the processing system (e.g., hardware latencies) to schedule the operations, inserting no-operations (NOOPs) when nothing else will work. Examples of user visible delay slots are memory load delay slots, branch delay slots and arithmetic delay slots. A load delay slot, for example, defines a number of cycles subsequent to a load operation during which the processing system cannot assume that the load has been completed.
An advantage of programming with user visible delay slots is that the processing system does not need the complex scoreboard or interlock hardware which is conventionally used to test the processor's registers for potential dependencies after the processor issues each operation. Scoreboard and interlock hardware is complex because it has to check for many dependencies at a high speed within each clock cycle. This special hardware is not only costly, but also tends to increase cycle time.
Although programs with user visible delay slots avoid the need for complex scoreboard or interlock hardware, such programs have a compatibility problem. A major disadvantage of programming with user visible delay slots is that the hardware latencies of the processing system must be known in order for the operations to be properly scheduled. As a result, a processing system which relies on user visible delay slots to prevent out-of-order execution will not be able to correctly execute programs created or compiled for a processing system having different hardware latencies.
Generally speaking, latency is defined as the number clock cycles between the time an input operand is ready for use by a hardware function and the time that a resultant operand from that function is ready for use by a subsequent hardware function. An assumed latency is the number of cycles which the programmer assumes a processor (which is to execute the program) needs to calculate the result of an operation. A hardware latency is the actual latency of the processor. A processing system typically has a number of processors which have fixed hardware latencies associated with each processor or fixed hardware latencies associated with each operation.
An operation is a command encoded in a processor instruction stream which describes an action which the processor is required to complete. An operation cannot issue until the availability of necessary input operands can be guaranteed. Typically, the necessary inputs and outputs are specified by register names. Furthermore, an instruction is a collection of one or more operations which are assumed by a programmer to be issued within a single cycle.
As technology develops, processors become faster and more powerful. As a result, hardware latencies are always changing. A program which is created or compiled with user visible delay slots for execution by a specific processor will likely not execute properly on processors having different latencies, even if the processors are from the same family of processors. Thus, programming with user visible delay slots is not effective when the latencies of a processing system which executes the program differ from the latencies which were fixed in the program when created or compiled. Accordingly, the conventional uses of user visible delay slots fail to provide compatibility with processing systems having hardware latencies differing from those assumed in the program.
In any case, it is not commercially feasible to recompile a program every time a next generation processor, of a family of processors, is developed. For a number of reasons, vendors and users of programs want as few versions of a program as possible. One major reason is that every new recompile would have to be completely retested which would be not only very time consuming but also expensive. Another reason is that vendors do not want to inventory and support many different versions of a program.
Table 1 illustrates a portion of a program which utilizes user visible delay slots which assume a fixed latency in a conventional manner. Assume that the program shown in Table 1 is compiled for a first type of processor having a hardware latency of three cycles for the operation op2. When compiling source code or writing assembly language programs, delay slot(s) are preferably filled with meaningful operations which could not possibly be dependent on the result of the previous operation. However, if no such operation is available, then a no-operation (NOOP) may be placed in the delay slot. In the example shown in Table 1, two delay slots were coded in the program between operation op2 and the user of its result (i.e., operation op3). The notations ra, rb, rc, rd and re refer to registers in the processing system.
Also assume that subsequent to the program being compiled for the first type of processor, a second type of processor is developed having a hardware latency of four cycles for the operation op2. Although the program executes properly on the first processor, the program will likely not operate correctly on the second processor. In particular, when the second processor begins execution of operation op2 in cycle (0), the result cannot be guaranteed to be returned into register rc until cycle (4). However, in cycle (3), the subsequent operation op3, which is dependent on the value in register rc, begins execution. Accordingly, executing the program with the second processor would yield an incorrect result. As a result, programs using user visible delay slots and compiled for a specific processor cannot be guaranteed to execute properly on earlier or later processor generations.
TABLE 1 ______________________________________ CYCLE ______________________________________ (0) rc = op2(ra,rb) (1) op (2) op (3) re = op3(rc,rd) (4) ______________________________________
One way to avoid this problem is to have the code generator (e.g., compiler) explicitly encode the dependency distance with each instruction. The technique is utilized in the experimental Horizon supercomputing system which has a shared-memory Multiple Instruction stream-Multiple Data (MIMD) stream computer architecture. The instruction set of the Horizon supercomputing system includes a lookahead field with every instruction. The lookahead field contains a value which is used to control instruction overlap. This value is guaranteed by the code generator to be less than or equal to the minimum distance to the next instruction that depends on the current instruction. That is, the value in the lookahead field indicates the number of additional instructions that may be issued before the current instruction is completed. For example, if the hardware latencies vary from one to eight cycles, then a three (3) bit lookahead field would be added to every instruction.
A disadvantage of the Horizon supercomputing system is that the value in the lookahead field applies to all three operations within the instruction, thereby forcing the value in the lookahead field to the worst case (smallest) value within the instruction. The experimental Horizon supercomputing system is more fully described in Kuehn and Smith, "The Horizon Supercomputing: Architecture and Software," Supercomputing '88 (IEEE), November 1988; Draper, "Compiling on Horizon," supercomputing '88 (IEEE), November 1988; and Thistle and Smith, "A processor Architecture for Horizon," Supercomputing '88 (IEEE), November 1988.
Another approach is to allow the compiler to make latency assumptions to optimize a program and then communicate these assumptions to the processor. A supercomputer, referred to as the Cydra.TM. 5 supercomputer, is a heterogenous multiprocessor system having a single numeric processor and one to six interactive processors sharing a common virtual memory system. The supercomputer has a directed dataflow architecture which requires that the latency of every operation be known. The programmer uses a "virtual time" memory latency assumption which perceives each memory request as taking the same amount of time. However, in reality the data may be ready sooner or later than expected. If the access time expected by the compiler is consistently less than the actual access time, the processor spends a significant fraction of its time in a frozen or stalled state. If the expected access time is consistently greater than the actual access time, the length of the schedules generated at a compile-time are unnecessarily dilated.
The Cydra.TM. 5 adjusts its nominal memory access time using a programmable memory latency register which contains the assumed memory latency value that the compiler used when scheduling the currently executing code. The memory system uses the value in the memory latency register to decide whether the data from memory is early or late and, consequently, whether the data should be buffered or the processor stalled. If the result is tardy, the processor will stall execution of the subsequent instruction until the result is ready. The memory latency register is used to provide flexibility in a machine when the value or best value of memory latency is unknown.
One limitation of this approach is that it improves performance only with respect to memory access. Another limitation is that if the memory latency register is changed to a larger value while there are some outstanding memory requests, the Cydra.TM. 5 may no longer execute properly. The Cydra.TM. 5 supercomputer is more fully described in Rau et al., "The Cydra.TM. 5 Stride-insensitive Memory system," 1989 International Conference on Parallel Processing, 1989.
While these optimizations in code generation techniques can improve performance using simpler dependency checking hardware than register scoreboards or interlock mechanisms, they tend to cause compatibility problems and can even produce incorrect results.