The present invention relates generally to processor design and more specifically to techniques for caching the values of instruction operands stored in memory.
Various techniques in the field of computer architecture have been developed for increasing processor performance beyond what can be achieved solely via process or circuit design improvements. One such technique is pipelining. Pipelining was extensively examined in "The Architecture of Pipelined Computers," by Peter M. Kogge (McGraw-Hill, 1981). J. L. Hennessy and D. A. Patterson provide a contemporary discussion of pipelining in chapter 6 of "Computer Architecture, A Quantitative Approch" (Morgan Kaufmann, 1990).
Pipeline processors decompose the execution of instructions into multiple successive stages, such as fetch, decode, and execute. Each stage of execution is designed to perform its work within the processor's basic machine cycle. Hardware is dedicated to performing the work defined by each stage. As the number of stages is increased, while keeping the work done by the instruction constant, the processor is said to be more heavily pipelined. Each instruction progresses from stage to stage, ideally with another instruction progressing in lockstep only one stage behind. Thus, there can be as many instructions in execution, as there are pipeline stages. The major attribute of a pipelined processor is that a throughput of one instruction per cycle can be obtained, though when viewed in isolation, each instruction requires as many cycles to perform as there are pipeline stages.
The ability to increase throughput via pipelining is limited by situations called pipeline hazards. Hazards may be caused by, among other things, data dependencies that arise due to the overlapping stages of instruction processing inherent in the pipeline technique. One type of data dependency that frequently arises is associated with an instruction that retrieves an operand from memory into a register. Later instructions that have progressed to a pipeline stage in which an operation using the value stored in that register is to be performed, and instructions depending on the results of operations that use the value stored in the register, must be stalled until the operand is retrieved from memory, i.e. until after the physical memory address of the operand is determined and the operand is retrieved from the memory (either from a unified or data cache, or in the worst case from a performance point of view, from main memory).
In other words, the inter-stage advance of instructions might have to be stalled until the required operand is retrieved. Otherwise, improper operation would result. To prevent such incorrect behavior, "interlock" logic is added to detect this hazard and invoke the pipeline stall. While the pipeline is stalled, there are stages in the pipeline that are not doing any useful work. Since this absence of work propagates from stage to stage, the term pipeline bubble is also used to describe this condition. The throughput of the processor suffers whenever such bubbles occur.
Many potential stalls resulting from data hazards can sometimes be avoided if a program is compiled initially or later recompiled using an optimizing compiler that rearranges program instructions in a manner that is custom tailored to the microarchitecture of the processor. Such optimizing compilers are relatively new, have restricted availability, and do not benefit programs that are already in the field. Rearranging instructions using an optimizing compiler is referred to as static instruction scheduling. The Intel Pentium.TM.Processor is an example of a processor that relies on static instruction scheduling to achieve its full promised performance.
In contrast to static techniques, dynamic instruction scheduling techniques act to rearrange the program instructions at the time the program is running. Dynamic scheduling does not require the use of an optimizing compiler and thus benefits all programs, both new and existing. One dynamic instruction scheduling technique involves the use of largely autonomous execution units that can queue up operations and execute them out of order. One such system is described in U.S. Pat. No. 5,226,126, PROCESSOR HAVING PLURALITY OF FUNCTIONAL UNITS FOR ORDERLY RETIRING OUTSTANDING OPERATIONS BASED UPON ITS ASSOCIATED TAGS, to McFarland et al., issued Jul. 6, 1993, which is assigned to the assignee of the present invention, and hereby incorporated by reference for all purposes.
In the U.S. Pat. No. 5,226,126 processor described a decoder issues operations simultaneously to all of the execution units, each of which queues up only the operations that require its services. During each cycle, each execution unit can service any operation from its queue not subject to an interlock, i.e. can execute operations out-of-order. Thus, despite the fact that some operations queued in the execution units may be subject to an interlock due to data (and other types of) hazards, there is a greater chance during any given cycle that an execution unit has useful work to perform.
Out-of-order execution tends to localize the effects of dependencies to a single execution unit. Because of their loose coupling and independent execution, stalls that affect only one execution unit can be effectively absorbed when that unit is later able to proceed past another execution unit that is held up due to a different dependency. If out-of-order execution were not used, often many of the execution units would be unnecessarily idle. Out-of-order execution results in the execution units doing useful work most of the time.
However, even in processors with loosely coupled execution units capable of out-of-order execution (such as the processor described in U.S. Pat. No. 5,226,126) there is a limit on the number of operations permitted to be outstanding at any time. Thus, given the significant frequency of operations depending (directly or indirectly) on the value of operands read from memory, there is a limit on the degree to which the execution units in even such processors can be kept busy.