1. Technical Field
The present invention generally relates to computer processing systems and, in particular, to methods and apparatus for reordering and renaming memory references in a multiprocessor computer system.
2. Background Description
Contemporary high-performance processors rely on superscalar, superpipelining, and/or very long instruction word (VLIW) techniques for exploiting instruction-level parallelism in programs (i.e., for executing more than one instruction at a time). In general, these processors contain multiple functional units, execute a sequential stream of instructions, are able to fetch from memory more than one instruction per cycle, and are able to dispatch for execution more than one instruction per cycle subject to dependencies and availability of resources.
The pool of instructions from which the processor selects those that are dispatched at a given point in time is enlarged by the use of out-of-order execution. Out-of-order execution is a technique by which the operations in a sequential stream of instructions are reordered so that operations appearing later are executed earlier, if the resources required by the later appearing operations are free. Thus, out-of-order execution reduces the overall execution time of a program by exploiting the availability of the multiple functional units and using resources that would otherwise be idle. Reordering the execution of operations requires reordering the results produced by those operations, so that the functional behavior of the program is the same as what would be obtained if the instructions were executed in their original sequential order.
In the case of memory-related operations, a memory store operation stores a datum in memory. A later memory load operation may read this datum from memory, load the datum into a processor register and, as is frequently the case, start a sequence of operations that depend on the datum. When directly bypassing such values from the store operation to a subsequent load operation, a slow main memory access may be substituted by a faster register-to-register access. In addition to using idle resources, the bypassing of such values reduces the critical path (i.e., the sequence of operations which determines the minimum execution time possible for a given code fragment) and reduces the number of memory operations which must be processed by the memory system. An additional performance improvement can be achieved by speculatively executing store operations out-of-order. Other benefits are the ability to reorder multiple store and load references to the same memory location by using a technique referred to as xe2x80x9crenaming of memory locationsxe2x80x9d.
In general, there are two basic approaches to implementing out-of-order execution and reordering of results: dynamic reordering and static reordering. In dynamic reordering, the instructions are analyzed at execution time, and the instructions and results are reordered in hardware. In static reordering, a compiler/programmer analyzes and reorders the instructions and the results produced by those instructions when the program is generated, thus the reordering tasks are accomplished through software. These two approaches can be jointly implemented.
To ensure that such operations are performed correctly, there must exist a mechanism to undo speculative memory references. Furthermore, store operations to the same address must be presented to the main memory in original program order.
In a multiprocessor environment, additional restrictions are posed on the ordering of memory operations. To achieve predictable and repeatable computation of programs in a multiprocessor environment, a requirement of xe2x80x98sequential consistencyxe2x80x99 is described in the article by L. Lamport, xe2x80x9cHow to Make a Multiprocessor that Correctly Executes Multiprocess Programsxe2x80x9d, IEEE Transactions on Computers, C-28(9), pp. 690-91 (September 1979). The article by Lamport defines a multiprocessor system as sequentially consistent if xe2x80x9cthe result of any execution is the same as if the operations of all processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its programxe2x80x9d. For static speculative execution, the order of the original logical program text is authoritative, not the reordered program text, and the compiler and hardware implementation must collaborate to generate an execution equivalent to that original order.
To achieve proper performance while simplifying coherence protocols between multiple processors in a computer system, several relaxations of the above described strictly sequential consistent order are possible. The types of re-ordering which are allowable depend on the memory consistency model guaranteed by a particular implementation. An overview of current and proposed consistency models and their characteristics is provided in the article by S. Adve and K. Gharachorloo, xe2x80x9cShared Memory Consistency Models: A Tutorialxe2x80x9d, Technical Report 9512, Dept. of Electrical and Computer Engineering, Rice University, Houston, Tex. (September 1995).
Typically, these requirements impose additional restrictions on the order of multiple store operations (even those store operations referring to different addresses), and of load and store operations (executed by the same processor or different processors, and referring to the same address or different addresses) with respect to each other.
While these requirements guarantee the correct operation of programs designed to work in the context of such coherency protocols, they impose limitations on the order of operation as instructions are executed in the processor. To achieve high performance while adhering to processor consistency models, a processor must be able to reorder memory operations internally and bypass results between them, but present the memory operations to the memory system in-order.
Accordingly, it would desirable and highly advantageous to have support for the following features in a high performance memory interface of a multiprocessor computer system implementing out-of-order execution, so as to provide maximum scheduling freedom:
1. The ability to execute store operations out-of-order, but retire them to memory in-order.
2. The ability to speculatively perform store operations, coupled with the ability to undo such store operations transparently (i.e., without influencing the correctness of program execution in a multiprocessor system).
3. The ability to hold multiple store result values for the same memory address, and resolve load references to these values, while at the same time retiring store values in original program order to the memory system.
Some example code sequences will now be given to illustrate the performance impact of implementing the above features in a processor supporting out-of-order execution of store operations.
With respect to the ability to execute store operations out-of-order with respect to each other, consider the following in-order code fragment, where the latency of a multiply (MUL) operation is three cycles, and that of all other operations is 1 cycle.
The preceding code fragment will execute on a single issue out-of-order processor without out-of-order execution of store operations in 5 cycles as follows:
With the capability to perform store operations out-of-order, the processor requires only four cycles using the following schedule:
With respect to control-speculative execution of store operations, consider the following code fragment:
If we assume that the branch is predicted to not be taken most of the time, and branch resolution requires 3 cycles from a compare operation to a branch operation using the condition, then the above code requires 5 cycles to execute even if the branch is correctly predicted to not be taken. This is because the store operation cannot be executed speculatively once the branch has been predicted as not taken, since store operations cannot be undone.
In contrast, the store operation could be performed speculatively in a memory system supporting the ability to undo store operations transparently. In such a case, if the branch is predicted correctly, then the above code fragment can execute in 3 cycles on a single issue out-of-order processor.
Finally, to execute store operations to the same address out-of-order and correctly resolve references, consider the following code fragment:
Similar sequences of load and store operations routinely occur in the presence of function calls, where frequently retrieved parameters corresponding thereto are stored on the stack. For an article describing the serializing effects of stack references, see xe2x80x9cThe Limits of Instruction Level Parallelism in SPEC95 Applicationsxe2x80x9d, Postiff et al., International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII) Workshop on Interaction between Compilers and Computer Architectures (INTERACT-3), October 1998.
Consider the execution of the preceding code fragment with a latency of 5 cycles for the divide instruction (DIV) on a single issue out-of-order processor without the capability to rename store memory references to the same address. The resulting schedule will require 10 cycles to execute:
With renaming of memory locations and in-order retirement of stored values, the code can be executed in 7 cycles. In the following schedule, multiple live names are indicated in square brackets which denote the order of retirement to the memory subsystem.
The examples above are not atypical; the necessity to perform store operations in-order with respect to both branches and other store operations degrades performance fairly severely by forcing the sequential execution of operations that could otherwise be executed in parallel. However, such a serialization can be avoided (that is, the load operation can be performed earlier than the store operation) as long as actual processor execution can be decoupled from the sequence of data values presented to main memory and other processors in a multiprocessor environment. Thus, some store operations are performed earlier than other store operations, and speculatively with respect to unresolved branches; load operations can reference such values out-of-order. Moreover, if load references are renamed correctly with respect to multiple store operations to the same address, any operation that depends on the datum loaded out-of-order can also be performed out-of-order.
A brief description of the operation of memory requests in a multiprocessor system will now be given with reference to FIGS. 1-3. FIG. 1 is a block diagram illustrating a simplified multiprocessor computing system 100 with private caches, to which the present invention may be applied. FIG. 2 is a flow diagram illustrating the actions taken upon receiving a memory access request in a multiprocessor environment having a memory hierarchy with private caches, according to the prior art. FIG. 3 is a flow diagram illustrating the actions taken by a processor 106 upon receiving a cache cross-interrogate request from a processor 102, according to the prior art. It is to be appreciated that the method of FIG. 3 is performed by processor 106 in response to step 224 of FIG. 2.
Referring to FIG. 1, multiprocessor computing system 100 includes a central processing unit (hereinafter xe2x80x9cprocessorxe2x80x9d or xe2x80x9cCPUxe2x80x9d) 102 operatively coupled to a private cache 104, and a processor 106 operatively coupled to a private cache 108.
A private cache is cache which services memory requests of a single processor. A shared cache services the memory requests of multiple processors. Caches can also be shared with respect to some processors, but private with respect to one or more other processors.
The processor 102 and private cache 104 comprise a node 103 of system 100. The processor 106 and private cache 108 comprise a node 107 of system 100.
It is to be appreciated that more than two nodes may be present in a multiprocessor computer system. Moreover, while private cache 104 and private cache 108 each imply a single cache unit, they are intended to also include a cache hierarchy that includes a plurality of caches (e.g., cache hierarchy 104 and cache hierarchy 108). Nonetheless, a single cache is implied in the remainder of this document for ease of understanding. It is to be further appreciated that a cache may be a shared cache with respect to some processors, and a private cache with respect to other processors. Given the teachings of the present invention provided herein, one of ordinary skill in the related art will be able to contemplate these and similar implementations of the elements of the invention.
Node 103, node 107, a main memory 110, a main memory 112, and input/output (I/O) devices 114 are all operatively coupled to each other through system bus 116. I/O 114 collectively refers to I/O adapters (e.g., video card) and the I/O devices (e.g., monitor) operatively coupled thereto. Main memory 110 and main memory 112 are shared between node 103 and node 107. It is to be appreciated that more elaborate interconection structures may be employed in place of system bus 116.
Referring to FIG. 2, processor 106 maintains control of a plurality of memory locations. Processor 102 accesses one of the plurality of memory locations according to the method of FIG. 2.
The cache controller of processor 102 receives a request for data from processor 102 (step 210). It is then determined whether the request refers to a memory location in private cache 104 of processor 102 (step 212). If the request does not refer to a memory location in private cache 104 of processor 102, then the method proceeds to step 220.
However, if the request refers to a location in private cache 104 of processor 102, then it is determined whether the entry in private cache 104 corresponding to the location has the required permissions (e.g., if a write request has been issued, whether the cache entry is in exclusive ownership mode which allows the write request to proceed, or in shared ownership mode, which only allows read requests to be processed) (step 214). If the entry does not have the required permissions, then the method proceeds to step 220. However, if the entry has the required permissions, then the request is satisfied from private cache 104 (step 218), and the method is terminated.
At step 220, it is determined, via a cross-interrogate from processor 102 to processor 106, whether the location is resident in private cache 108 of processor 106. If the location is not resident in private cache 108 of processor 106, then the method proceeds to step 232. However, if the location is resident in private cache 108 of processor 106, then the location is requested with the appropriate permissions (via a cross-interrogate), and the method proceeds to step 234 (step 224).
At step 232, the memory location is fetched from main memory, and the method proceeds to step 234. At step 234, the memory location (which was either fetched at step 232 or received (in response to the cross-interrogate) at step 224) is stored in private cache 104, the request is satisfied, and the method is terminated.
Referring to FIG. 3, when processor 106 receives a cross interrogate request from processor 102 (step 310), processor 106 searches a private cache directory to determine whether the request refers to a location in private cache 108 of processor 106 (step 312). If the request does not refer to a location in private cache 108 of processor 106, then the method is terminated (step 314). However, if the request refers to a location in private cache 108 of processor 106, then private cache 108 returns the location to the requesting processor (i.e., processor 102) and the method is terminated (step 316).
3. Problems With the State of the Art
A description of the prior art and deficiencies associated therewith will now be given. For example, an address resolution buffer which supports out-of-order execution of memory operations and memory renaming is described by M. Franklin and G. Sohi, in xe2x80x9cARB: A Hardware Mechanism for Dynamic Reordering of Memory Referencesxe2x80x9d, IEEE Transactions on Computers, Vol. 45, No. 5, May 1996. At least one problem with this approach is that it does not address multiprocessor issues, since it is limited to uniprocessor implementations.
U.S. Pat. No. 5,911,057, entitled xe2x80x9cSuperscalar Microprocessor Having Combined Register and Memory Renaming Circuits, Systems, and Methodsxe2x80x9d, issued on Jun. 8, 1999, the disclosure of which is incorporated herein by reference, describes an architecture for renaming memory and register operands in uniform fashion. Memory coherence is based upon snooping memory requests. While this approach is sufficient for the in-order execution of memory operations in a multiprocessor computing system, out-of-order operation may generate incorrect results in a multiprocessor system. U.S. Pat. No. 5,838,941, entitled xe2x80x9cOut-of-order Superscalar Microprocessor With a Renaming Device that Maps Instructions From Memory to Registersxe2x80x9d, issued on Nov. 17, 1998, the disclosure of which is incorporated herein by reference, describes symbolic renaming of memory references. At least one problem with this approach is that it does not address multiprocessor issues, since it is limited to uniprocessor implementations.
U.S. Pat. No. 5,872,990 (hereinafter the xe2x80x9c""990 Patentxe2x80x9d), entitled xe2x80x9cReordering of Memory Reference Operations and Conflict Resolution via Rollback in a Multiprocessing Environmentxe2x80x9d, issued on Feb. 16, 1999, assigned to the assignee herein, the disclosure of which is incorporated herein by reference, uses a checkpointing and rollback scheme to implement strong memory consistency in multiprocessing systems with shared caches. While shared cache architectures offer a simpler execution model for multiprocessor systems, their scalability is limited by the number of cache ports and a number of physical factors (wire length, wire density, chip size, and so forth). Further, the ""990 Patent does not address the issues of a system with private caches.
U.S. Pat. No. 5,832,205 (hereinafter the xe2x80x9c""205 Patentxe2x80x9d), entitled xe2x80x9cMemory Controller for a Microprocessor for Detecting a Failure of Speculation on the Physical Nature of a Component Being Addressedxe2x80x9d, issued on Nov. 3, 1998, the disclosure of which is incorporated herein by reference, describes a gated store buffer architecture for use in a uniprocessor system. This gated store buffer is used to allow rollback of the architecture state in the presence of error conditions. However, the claimed architecture does not support memory renaming. Further, as stated above, its operation is limited to uniprocessor computer systems.
FIG. 4 is a block diagram illustrating a gated store buffer 410 according to the prior art, namely, U.S. Pat. No. 5,832,205. The store buffer 410 consists of a queue 412 with three pointers, i.e., a head pointer 414, a gate pointer 416, and a tail pointer 418. The head pointer 414 indicates the starting point of entries in the store buffer 410. Memory data stored in the store buffer 410 between the head pointer 414 and the gate pointer 416 are committed, and form a part of the logical memory image. The gate pointer 416 marks the beginning of uncommitted memory data. The end of such data is marked by the tail pointer 418. Uncommitted data can either be committed in the case of successful operation, or discarded when a rollback occurs.
The architected processor state is modified in conjunction with the commit or rollback of the store buffer architecture. Operation of the store buffer architecture is limited to a particular code generation strategy based on static scheduling at runtime (including a binary translation step) using a described underlying Very Large Instruction Word (VLIW) architecture (xe2x80x9cmorph hostxe2x80x9d) with support for processor state commitment or rollback. After a rollback, operation is restarted using in-order execution.
While gated store buffers offer desirable properties to achieve high performance implementations, their use has not been possible in a multiprocessor (MP) environment since typical gated store buffer implementations can result in incorrect operation and/or deadlock situations in an MP environment.
The problem surfaces during actions which are to be taken on a cross-interrogate from a requesting processor. When the location requested by a cross-interrogate is found in a gated store buffer, the following actions are possible:
1. Supply the data found in the gated store buffer. This can result in an incorrect value supplied to the other processor if the data in the gated store buffer is later discarded, e.g., due to incorrect speculation. This violates the requirement of transparent execution of incorrectly speculated operations.
2. Ignore the data in the gated store buffer. This violates memory consistency requirements and may result in incorrect operations.
3. Wait for the data in the gated store buffer to be resolved before sending a response to a cross-interrogate request, i.e., either committed to the memory state or revoked from the store buffer. This can result in a deadlock situation.
Consider the following examples, which illustrate the danger of incorrect program execution. The first example is provided to illustrate incorrect operation when a value returned from the store buffer is later revoked.
For the first example, consider the following code fragment corresponding to an in-order sequence in a program, the fragment assuring that the value in location (r8) never contains the value 0:
Out-of-order execution may generate the following out-of-order sequence:
Consider a case where r4 contains the value 0, and a cross-interrogate request from processor 2 is answered with the value stored by the gated store instruction (instruction 2). Then, instruction 3 on processor 2 receives the value deposited by the gated store buffer (instruction 2 on processor 1). Even if that store instruction is later revoked (instruction 6), the value has already been incorrectly loaded by processor 2 and is used for further processing, leading to incorrect results.
The second example illustrates the incorrect operation of programs in the presence of gated store buffers when data in the gated store buffer is ignored and the original value is supplied on a cross-interrogate. The second example also illustrates how a deadlock situation can occur if the responses to cross-interrogate requests are delayed until data in the gated store buffers have been resolved.
For the second example, consider the following code fragments corresponding to in-order programs executing on two processors. In the second example, it is presumed that register r8 holds the same memory address on both processors. Also, register r9 holds the same value on both processors. Furthermore, registers r8 and r9 refer to distinct, non-overlapping memory locations, and data memory is initialized to contain 0 in both locations:
The programs are based on a well-known test case for coherent memory implementation. To execute correctly, register r4 must contain the value xe2x80x9c1xe2x80x9d on at least one processor after execution.
Now, consider a program which has been reordered to achieve a better instruction schedule. The program uses the capabilities provided by a gated store buffer to ensure that store results from instruction 2 are retired to memory before the store results from instruction 4.
In an implementation ignoring data in gated store buffers and supplying a previous data value, both load operations at instruction 4xe2x80x2 will be supplied not with the data value deposited by gated store operations at instruction 3xe2x80x2. Rather, the pre-initialized data value of 0 will be supplied to both load operations at instruction 4xe2x80x2. This corresponds to an incorrect execution of the multiprocessor program.
The following third implementation choice, delaying answers to cross-interrogate requests until gated store buffer data is committed which references data in the store buffer, leads to a dead-lock when both processors wait for the results from cross-interrogate requests to resolve the load operations at instruction 4xe2x80x2:
Thus, having demonstrated the inadequacy of gated store buffers and other prior art approaches to reordering and renaming memory references in a multiprocessor environment, it can be seen that there is a need for a better architecture and/or methodology for reordering and renaming memory references in a multiprocessor computer system.
The problems stated above, as well as other related problems of the prior art, are solved by the present invention, a method and apparatus for reordering memory operations in a processor.
In a first aspect of the invention, there is provided a method for reordering and renaming memory references in a multiprocessor computer system having at least a first and a second processor. The first processor has a first private cache and a first buffer, and the second processor has a second private cache and a second buffer. The method includes the steps of, for each of a plurality of gated store requests received by the first processor to store a datum, exclusively acquiring a cache line that contains the datum by the first private cache, and storing the datum in the first buffer. Upon the first buffer receiving a load request from the first processor to load a particular datum, the particular datum is provided to the first processor from among the data stored in the first buffer based on an in-order sequence of load and store operations. Upon the first cache receiving a load request from the second cache for a given datum, an error condition is indicated and a current state of at least one of the processors is reset to an earlier state when the load request for the given datum corresponds to the data stored in the first buffer.
In a second aspect of the invention, the method further includes the step of committing at least some of the data in the first buffer to an architected memory state of the computer system, prior to the indicating step, to remove the at least some of the data from the first buffer. The indicating step is performed only when the given datum in the first buffer is not committed.
In a third aspect of the invention, the committing step commits a specified datum to the architected memory state of the computer system when a gated store request corresponding to the specified datum is in-order with respect to all instructions that precede the gated store request.
In a fourth aspect of the invention, the resetting step includes the step of discarding at least some of the data in the first buffer.
In a fifth aspect of the invention, the resetting step includes the step of discarding the given datum from the first buffer and all data stored thereafter.
In a sixth aspect of the invention, the method further includes the step of releasing the cache line when operations referring to the cache line have completed execution in-order.
In a seventh aspect of the invention, the method further includes the step of releasing the cache line, when the datum contained within the cache line is committed to an architected memory state of the computer system in-order or when the datum is discarded from the first buffer.
In an eighth aspect of the invention, the earlier state corresponds to an operation immediately preceding the gated store request that stored the given datum in the first buffer.
In a ninth aspect of the invention, the method further includes the step of generating a snapshot of the earlier state.
In a tenth aspect of the invention, the generating step includes one of the steps of copying contents of registers corresponding to the earlier state, and maintaining a record of incremental state changes from at least one state preceding the earlier state up to the earlier state.
In an eleventh aspect of the invention, the method further includes the step of storing a snapshot of the earlier state in the first buffer.
In a twelfth aspect of the invention, the method further includes the step of storing a snapshot of the earlier state in one of the first processor, the second processor, and a storage device external thereto. A timestamp corresponding to the snapshot of the earlier state is stored in the first buffer in association with the given datum.
In a thirteenth aspect of the invention, the resetting step includes the step of searching for the timestamp in the first buffer to identify the snapshot from among a plurality of snapshots stored in one of the first processor, the second processor, and the storage device external thereto.
In a fourteenth aspect of the invention, the method further includes the step of processing the store and load requests in-order and suspending the steps involving the first buffer, upon performing a predetermined number of resetting steps.
These and other aspects, features and advantages of the present invention will become apparent from the following detailed description of preferred embodiments, which is to be read in connection with the accompanying drawings.