1. Field of the Invention
The present invention relates to the field of computers, and more specifically to techniques for pipeline control in processors.
2. Description of the Related Art
The ever-growing requirement for high performance computers demands that computer hardware architecture maximize software performance. One technique for improving computer performance is to increase the number of instructions executed per clock cycle by the processor (xe2x80x9cprocessor throughputxe2x80x9d). Processor throughput may be increased by pipelining, where the processor is divided into separate processing stages (collectively termed a xe2x80x9cpipelinexe2x80x9d). Whereas early computer processors executed only one instruction at a time, pipelined designs allow a new instruction to begin execution before a previous instruction is finished executing, increasing the rate at which instructions can be executed.
Two common types of pipeline operand management that can be implemented in a processor are xe2x80x9cprotectedxe2x80x9d and xe2x80x9cunprotectedxe2x80x9d. They each have an effect on the portability of code and implementation difficulty. Historically, general-purpose microprocessors have implemented protected pipelines, because the original microprocessors were not pipelined and executed code serially. For code compatibility, later designs were required to maintain this xe2x80x9cserial execution modelxe2x80x9d and implement protected pipelines. Digital signal processors, on the other hand, generally did not migrate code from one generation to the next, so they tend to implement unprotected pipelines.
Software for protected pipelines is written as if each instruction is executed in its entirety before the next one is started. This means that hardware must detect dependencies between instructions in order to correctly execute them in parallel. Interrupts (e.g., page faults or divide-by-zero) and unknown latencies (e.g., cache misses or data-dependent instruction timing) are handled through hardware interlocks that track register targets and register sources. In this scheme, once an instruction is issued, it can write its results at anytime thereafter, because all subsequent instructions expect to see its result, and no subsequent instructions expect to see the prior value of that result register.
The unprotected pipeline approach moves complexity out of the hardware and into the software by requiring that the code understand how instructions are executed in parallel and how long each instruction takes to produce a result, and also that the code handle interrupts and unknown latencies. Thus, rather than tracking dependencies between instructions, the hardware simply assumes there are not any. This saves a significant amount of hardware and exposes the operation of the pipeline to the programmer.
Efficient software can be written for an unprotected pipeline. If a register-writing instruction requires more than one clock cycle to complete, a register may be read after a write to it is initiated, but before that write actually occurs. Since the hardware is not checking dependencies between instructions as in protected pipelines, it can intentionally read xe2x80x9cstalexe2x80x9d values from the register file before they are updated. The ability to read xe2x80x9cstalexe2x80x9d values provides two benefits. First, the program gets to see a slightly larger register set for a short period of time, and second, the scheduling of operands in loops can be made simpler (which reduces code size). Multiple Assignment Code is code that relies on reading these xe2x80x9cstalexe2x80x9d values from registers.
Multiple Assignment Code
Multiple assignment code can increase the apparent size of a register file and decrease the size of iterative code loops. The following example in TABLE 1 shows a typical case where a value is loaded into register R0 and then added to another value, contained in register R1. (In all of the following examples, the LOAD instructions require a two cycle latency period to write their result into the register file, and all other instructions require one cycle to write their result.)
FIG. 1A illustrates a protected pipeline stalling an instruction until a particular value is available. A protected pipeline would detect that register R0 is not available until cycle 3 (because of the latency) and stall the ADD until it is available 70. FIG. 1B illustrates an unprotected pipeline using a xe2x80x9cstale valuexe2x80x9d. An unprotected pipeline relies on the code itself to properly schedule operand accesses. In this case, because there is no cycle (caused by a non-operation or other instruction) between the LOAD and ADD, the value of R0 before the end of the latency period, the xe2x80x9cstale valuexe2x80x9d, will be used by the ADD instruction 72.
A protected pipeline experiences inefficiency by introducing stalls, and an unprotected pipeline experiences inefficiency by including non-operation instructions. If an unprotected pipeline is to appear to operate as a protected pipeline does, it must assume outstanding register results make registers xe2x80x9coff limitsxe2x80x9d until the result has been written. Accordingly, any LOAD requires one register to be allocated the following cycle simply to keep track of where the result is written; for loads with more than one additional cycle of latency and multiple pipelined loads, more than one such register is required. This is not a big problem with two cycle latency loads, but if the latency increases, then so does the number of registers required.
LOAD instructions on the Texas Instruments TMS320C62XX family of processors require five cycles of latency, or at most four registers for scheduling the return of these results. With a register file containing 32 entries, the incremental advantage is to make the register file appear to be about 13 percent larger (four more registers). Future designs, moreover, are likely to increase operation latencies, and not just for LOAD instructions. It is probable that all operations will require more than one cycle of latency, and LOADs could require as many as a dozen. It should be obvious that, without support for a multiple assignment code, a register file would have to be unreasonably large just to schedule operands. Accordingly, what is needed is an apparatus and method that supports multiple assignment code to control a processor pipeline and allows for pipeline stalls and interruptions.
The present invention is an apparatus and method that supports multiple assignment code, comprising a register that may be assigned multiple values, instructions that receive a value and dispatch a result, a stall queue that stores the results of an instruction during a pipeline stall, and an interrupt queue that stores the state of the register when an interrupt is taken. The value assigned to a register may be available for processing only after a latency period has passed. The value received by an instruction from a register is the most recent value assigned to the register for which the latency period has passed. The result of an instruction that requires an N unit latency period appears in a register file before the Nth following group of instructions are dispatched, and after the group of instructions immediately preceding the present one. The stall queue eliminates the need for global stalls in the context of various pipeline implementations. The interrupt queue allows for interruptibility.
The present invention further comprises a mode where the software identifies that it does not intend to exploit multiple assignment code.