The present application shares specification text and figures with the following applications, filed concurrently with the present application: application Ser. No. 09/437,185, xe2x80x9cImproved Cache State Protocol For Shared Locks in a Multiprocessor System,xe2x80x9d application Ser. No. 09/437,187, xe2x80x9cHigh Speed Lock Acquisition Mechanism With Time Parameterized Cache Coherency States,xe2x80x9d application Ser. No. 09/437,182, xe2x80x9cHigh Speed Lock Acquisition Mechanism via a xe2x80x9cOne Shotxe2x80x9d Modified State Cache CoherencyProtocol,xe2x80x9d application Ser. No. 09/437,183, xe2x80x9cAn Extended Cache Coherency Protocol With a Modified Store Instruction Lock Release Indicator,xe2x80x9d and application Ser. No. 09/437,186, xe2x80x9cAn Extended Cache Coherency Protocol With a Persistent xe2x80x9cLock Acquiredxe2x80x9d State.xe2x80x9d
1. Technical Field
The present invention generally relates to an improved data processing system and in particular to a system and method for improved cache management in a multiprocessor system. Still more particularly, the present invention relates to a system and method using specialized cache states and state sequences to provide improved cache coherency management in a multiprocessor data processing system.
2. Description of the Related Art
In order to enhance performance, state-of-the-art data processing systems often utilize multiple processors which concurrently execute portions of a given task. To further enhance performance, such multiple processor (MP) data processing systems often utilize a multi-level memory hierarchy to reduce the access time required to retrieve data from memory. A MP data processing system may include a number of processors, each with an associated level-one (L1) cache, a number of level-two (L2) caches, and a number of modules of system memory. Typically, the memory hierarchy is arranged such that each L2 cache is accessed by a subset of the L1 caches within the system via a local bus. In turn, each L2 cache and system memory module is coupled to a system bus or interconnect switch, such that an L2 cache within the MP data processing system may access data from any of the system memory modules coupled to the bus or interconnect switch.
Because each of the number of processors within a MP data processing system may modify data, MP data processing systems must employ a protocol to maintain memory coherence. For example, MP data processing systems utilizing PowerPC RISC processors utilize a coherency protocol having four possible states: modified (M), exclusive (E), shared (S), and invalid (I). The MESI state associated with each cache line (i.e., the line state) informs the MP data processing system what memory operations are required to maintain memory coherence following an access to that cache line. Depending upon the type of MP data processing system utilized, a memory protocol may be implemented in different ways. In snoop-bus MP data processing systems, each processor snoops transactions on the bus to determine if cached data has been requested by another processor. Based upon request addresses snooped on the bus, each processor sets the MESI state associated with each line of its cached data. In contrast, within a directory-based MP data processing system, a processor forwards memory requests to a directory at a lower level of memory for coherence ownership arbitration. For example, if a first processor (CPUa) requests data within a memory line that a second processor (CPUb) owns in exclusive state in CPUb""s associated L1 cache, CPUa transmits a load request to the system memory module which stores the requested memory line. In response to the load request, the memory directory within the interrogated system memory module loads the requested memory line to CPUa and transmits a cross-interrogation message to CPUb. In response to the cross-interrogation message, CPUb will mark the requested cache line as shared in its associated L1 cache.
Among designers of MP data processing systems, there has been a recent interest in the use of load-reserve and store-conditional instructions which enable atomic accesses to memory from multiple processors while maintaining memory coherence. For example, load-reserve and store-conditional instructions on a single word operand have been implemented in the PowerPC RISC processor instruction set with the LARWX and STCWX instructions, respectively, which will be referenced as LARX and STCX. In MP data processing systems which support LARX and STCX or analogous instructions, each processor within the system includes a reservation register. When a processor executes a LARX to a variable, the processor, known as the requesting processor, loads the contents of the address storing the variable from the requesting processor""s associated L1 cache into a register and the address of the memory segment containing the variable into the reservation register. Typically, the reservation address indexes a segment of memory, called a reservation granule, having a data width less than or equal the requesting processor""s L1 cache line. The requesting processor is then said to have a reservation with respect to the reservation granule. The processor may then perform atomic updates of the reserved variable utilizing store-conditional instructions.
When a processor executes a STCX to a variable contained in a reservation granule for which the processor has a reservation, the processor stores the contents of a designated register to the variable""s address and then clears the reservation. If the processor does not have a reservation for the variable, the instruction fails and the memory store operation is not performed. In general, the processor""s reservation is cleared if either a remote processor stores data to the address containing the reserved variable or the reserving processor executes a STCX instruction. Additional background information about load-reserve and store-conditional instructions in a multiprocessor environment may be found, for example, in Sites, et al., U.S. Pat. No. 5,193,167, which is hereby incorporated by reference.
FIG. 3 shows a flowchart of a process to complete a store operation to a cache in a multiprocessor environment, where a lock on the wordline must be acquired. When the store is to be done, the address of the wordline is loaded with a LARX (step 300). A comparison check is performed (step 305) to determine if a lock was acquired for that wordline (step 310). If the lock was acquired, we attempt a store (step 345), described below.
Assuming, however, that the lock was not acquired, because it is owned by another processor, the status register for that line is loaded (step 315), and the status the wordline is checked (step 320) to determine when the lock is released. As long as the lock is not released (step 325), the process loops back to step 315 to keep checking.
When the lock is finally released (step 325), the processor again tries to acquire a lock. The address of the wordline is loaded with a LARX (step 330), and a comparison check is performed (step 335) to determine if a lock was acquired for that wordline (step 335). If the lock was acquired, the processor attempts a store (step 345); if not, the processor begins the process over again at step 300. [couldn""t we just go to step 315?]
When the lock is acquired, the store is attempted (step 345). If it is successful (step 350), the lock is released, and the processor resumes its normal programming. If, however, the store is unsuccessful, this will mean that we lost the lock; the process restarts at step 300).
This process is, of course, very expensive in terms of processor cycles. Because of the imbedded loops necessary to make sure that a lock is acquired before the store, a STCX generally consumes about 100 cycles.
Typically, MP data processing systems which include a memory hierarchy track the reservation state of each reservation granule utilizing a reservation protocol similar in operation to the memory coherence protocol discussed above. Such MP data processing systems generally record each processor""s reservation at the system memory (main store) level. For example, each main memory module may include a reservation register for each processor that indicates which reservation granule, if any, is reserved by the associated processor. Because processor reservations are maintained at the system memory level, each execution of an instruction which affects the reservation status of a reserved granule requires that a reservation message be transmitted to the system memory module containing the target reservation granule. These reservation messages slow overall MP system performance because of the additional traffic they create on the interconnect switch or system bus and because of delays in determining if a requesting processor may successfully execute a STCX.
Consequently, it would be desirable to provide an improved method and system for memory updates in a MP data processing system in which reservations may be resolved at higher levels within the memory hierarchy, thereby minimizing reservation messaging and enhancing MP data processing system performance.
It is therefore one object of the present invention to provide an improved data processing system.
It is another object of the present invention to provide a system and method for improved cache management in a multiprocessor system.
It is yet another object of the present invention to provide a system and method using specialized cache states and state sequences to provide improved cache coherency management in a multiprocessor data processing system.
The foregoing objects are achieved as is now described.
A multiprocessor data processing system requires careful management to maintain cache coherency. Conventional systems using a MESI approach sacrifice some performance with inefficient lock-acquisition and lock-retention techniques. The disclosed system provides additional cache states, indicator bits, and lock-acquisition routines to improve cache performance. In particular, as multiple processors compete for the same cache line, a significant amount of processor time is lost determining if another processor""s cache line lock has been released and attempting to reserve that cache line while it is still owned by the other processor. The preferred embodiment provides an additional cache state which specifically indicates that a processor has released its lock on a cache line after it has performed any necessary modifications.
The above as well as additional objectives, features, and advantages of the present invention will become apparent in the following detailed written description.