1. Field of the Invention
The present invention relates the design of multiprocessor systems. More specifically, the present invention relates to a method and an apparatus for facilitating speculative load operations and/or speculative store operations in a multiprocessor system.
2. Related Art
In order to achieve high rates of computational performance, computer system designers are beginning to employ multiple processors that operate in parallel to perform a single computational task. One common multiprocessor design includes a number of processors 151-154 coupled to level one (L1) caches 161-164 that share a single level two (L2) cache 180 and a memory 183 (see FIG. 1A). During operation, if a processor 151 accesses a data item that is not present in local L1 cache 161, the system attempts to retrieve the data item from L2 cache 180. If the data item is not present in L2 cache 180, the system first retrieves the data item from memory 183 into L2 cache 180, and then from L2 cache 180 into L1 cache 161.
Note that coherence problems can arise if a copy of the same data item exists in more than one L1 cache. In this case, modifications to a first version of a data item in L1 cache 161 may cause the first version to be different than a second version of the data item in L1 cache 162.
In order to prevent such coherency problems, computer systems often provide a coherency protocol that operates across bus 170. A coherency protocol typically ensures that if one copy of a data item is modified in L1 cache 161, other copies of the same data item in L1 caches 162-164, in L2 cache 180 and in memory 183 are updated or invalidated to reflect the modification.
Coherence protocols typically perform invalidations by broadcasting invalidation messages across bus 170. However, as multiprocessor systems increase in performance, such invalidations occur more frequently. Hence, these invalidation messages can potentially tie up bus 170, and can thereby degrade overall system performance.
In order to remedy this problem, some designers have begun to explore the possibility of maintaining directory information within L2 cache 180. This directory information specifies which L1 caches contain copies of specific data items. This allows the system to send invalidation information to only the L1 caches that contain the data item instead of sending a broadcast message to all L1 caches. (This type of system presumes that there exist separate communication pathways for invalidation messages to each of the L1 caches 161-164, unlike the example illustrated in FIG. 1A, which uses a single shared bus 170 to communicate with L1 caches 161-164.)
As multiprocessor systems continue to increase in performance, it is becoming increasingly harder to support memory models that significantly restrict the ordering of load and store operations. One commonly used memory model is the xe2x80x9cTotal Store Orderxe2x80x9d (TSO) memory model. Under the TSO memory model, loads and stores from a given processor typically execute in program order, except that loads can overtake previous stores. More specifically, under the TSO memory model: loads cannot overtake previous loads; stores cannot overtake previous stores; and stores cannot overtake previous loads. However, loads can overtake previous stores. This allows previous stores to take place in a lazy fashion while the system performs subsequent loads.
Unfortunately, placing these restrictions on the ordering of load and store operations can seriously degrade multiprocessor performance, because the multiprocessor system often has to wait for previous memory operations to complete before executing subsequent memory operations.
A less restrictive memory model is xe2x80x9crelease consistency,xe2x80x9d in which the only restriction is that processors see a consistent view of shared data whenever a critical region is exited. This memory model is less restrictive than TSO and can lead to better multiprocessor performance. Unfortunately, many existing legacy applications make use of restrictive memory models, such as TSO.
Hence, in order to run these legacy applications, what is needed is a method and an apparatus for facilitating efficient parallel execution of programs under a restrictive memory model, such as the TSO memory model.
One embodiment of the present invention provides a system that facilitates speculative load operations in a multiprocessor system. The system operates by maintaining a record of speculative load operations that have completed at a processor in the multiprocessor system, wherein a speculative load operation is a load operation that is speculatively initiated before a preceding load operation has returned. Next, the system receives an invalidation signal at an L1 cache that is coupled to the processor, wherein the invalidation signal indicates that a specific line in the L1 cache is to be invalidated. In response to this invalidation signal, the system examines the record of speculative load operations to determine if there exists a matching speculative load operation that is completed and is directed to the same location in the L1 cache that the invalidation signal is directed to. If there exists a matching speculative load operation, the system replays the matching speculative load operation so that the matching speculative load operation takes place after an event that caused the invalidation signal completes.
In one embodiment of the present invention, the record of speculative load operations includes a plurality of banks, wherein each bank contains speculative load operations directed to a specific bank of the L2 cache.
In one embodiment of the present invention, the record of speculative load operations maintains set and way information for entries in the L1 cache that contain results of speculative load operations.
In one embodiment of the present invention, the invalidation signal is received as a result of a cache coherency protocol operation.
In one embodiment of the present invention, the invalidation signal is received as a result of a store operation associated with the specific line in the L1 cache.
In one embodiment of the present invention, invalidation signal is received as a result of an invalidation of a corresponding line in the L2 cache.
In one embodiment of the present invention, the record of speculative load operations includes an indicator for each speculative load operation. This indicator specifies whether the speculative load operation has completed.
In one embodiment of the present invention, maintaining the record of speculative load operations involves updating the record whenever a new speculative load operation completes.
In one embodiment of the present invention, the system receives a replay signal at the processor from the L2 cache, wherein the replay signal identifies a specific set and way location. In response to this replay signal, the system replays any speculative load operation that has completed and is directed to the specific set and way location. Note that he system performs this replay without performing a corresponding invalidation.
In one embodiment of the present invention, the multiprocessor system implements a total store ordering (TSO) memory model in which loads can overtake previous stores, loads cannot overtake previous loads, stores cannot overtake previous loads, and stores cannot overtake previous stores.
Another embodiment of the present invention provides a system that facilitates speculative load operations in a multiprocessor system. This system operates by maintaining a record at an L2 cache of speculative load operations that have returned data values through the L2 cache to associated L1 caches, wherein a speculative load operation is a load operation that is speculatively initiated before a preceding load operation has returned. In response to receiving an invalidation event, the system invalidates a target line in the L2 cache. The system also performs a lookup in the record to identify affected L1 caches that are associated with speculative load operations that may be affected by the invalidation of the target line in the L2 cache. Next, the system sends replay commands to the affected L1 caches in order to replay the affected speculative load operations, so that the affected speculative load operations take place after invalidation of the target line in the L2 cache.
In one embodiment of the present invention, maintaining the record involves receiving a load miss operation from an L1 cache at the L2 cache, wherein the load miss operation contains information specifying whether there exists a speculative load operation that has returned for an L1 cache location associated with the load miss operation. If there exists such a speculative load operation, the system updates the record to indicate that the L1 cache is associated with the speculative load operation.
In a variation on this embodiment, the load miss operation identifies the L1 cache location associated with the load miss operation, and updating the record involves recording the L1 cache location in the record, thereby enabling a subsequent replay command to include the L1 cache location. If the load miss operation is not speculative, the system updates the record to indicate that an associated entry in the L1 cache is not associated with a returned speculative load operation.
In one embodiment of the present invention, if replay commands are sent to one or more L1 caches for an L2 cache line, the system updates the record to indicate that the L2 cache line is no longer associated with returned speculative load operations.
In one embodiment of the present invention, the L2 cache includes a reverse directory including entries for lines in L1 caches, wherein each entry identifies an associated entry in the L2 cache. In a variation on this embodiment, the reverse directory includes a fixed entry corresponding to each entry in each of the L1 caches. In a variation on this embodiment, each entry in the reverse directory includes information specifying a location of a corresponding entry in the L2 cache.
One embodiment of the present invention provides a system for facilitating speculative store operations in a multiprocessor system. This system operates by maintaining a record of speculative store operations that are in process at an L2 cache in the multiprocessor system, wherein a speculative store operation is a store operation that is speculatively executed before a preceding store operation has returned. Upon receiving a load operation at the L2 cache from an L1 cache, the system examines the record of speculative store operations to determine if there exists a matching speculative store operation that is directed to the same location that the load operation is directed to. If so, the system ensures that the load operation takes place after the matching speculative store operation completes.
In one embodiment of the present invention, ensuring that the load operation takes place after the matching speculative store operation completes involves sending a retry operation to the processor to cause the processor to retry the load operation at a later time.
In one embodiment of the present invention, ensuring that the load operation takes place after the matching speculative store operation completes involves waiting for the matching speculative store operation to complete before completing the load operation at the L2 cache.
In one embodiment of the present invention, upon completion of the matching speculative store operation at the L2 cache, the L2 cache allows the load operation to take place and sends invalidation signals to other L1 caches containing lines that are invalidated by the matching speculative store operation.
In one embodiment of the present invention, upon receiving a speculative store operation from a processor at the L2 cache, the system stores the speculative store operation in the record.
In one embodiment of the present invention, upon completion of a store operation at the L2 cache, the system sends an acknowledgement to a source processor that initiated the store operation. Upon receiving a move signal from the source processor in response to the acknowledgement, the system updates the record to indicate that the given store operation is no longer speculative.
In a variation on this embodiment, upon receiving the acknowledgement at the source processor, the source processor waits until all preceding store operations complete before sending the move signal to the L2 cache.
In a variation on this embodiment, upon completion of the store operation at the L2 cache, the system sends invalidation signals to L1 caches containing cache lines that are overwritten by the store operation.
In one embodiment of the present invention, for each processor coupled to the L2 cache, the record of speculative store operations includes a store queue containing speculative store operations.
In one embodiment of the present invention, the L2 cache includes a plurality of banks, and for each L2 bank, the record of speculative store operations includes a store queue for each processor coupled to the L2 cache.
In one embodiment of the present invention, the system receives a read-to-own request for a target cache line in order to perform a given store operation to the target cache line. Upon receiving the read-to-own request, the system examines the record of speculative store operations to determine if there exists a matching speculative store operation that is directed to the target cache line. If so, the system passes the target cache line to the requesting processor in a write-only state, so that the requesting processor is able to perform a write operation (but not a read operation) to the target cache line, thereby avoiding a deadlock condition.
In one embodiment of the present invention, the system receives a store operation at the L2 cache from an L1 cache coupled to a processor in the multiprocessor system. The system then examines the record of speculative store operations to determine if there exists a matching speculative store operation that is directed to the same location that the store operation is directed to. If so, the system drops the store operation.