1. Field of the Invention
This invention relates to multiprocessor computer systems, and more particularly to the implementation of instructions used to achieve synchronization, mutual exclusion, and atomic read write.
2. Background Information
In a multiprocessor computer system the individual processors must write to the shared memory of the system in a synchronized fashion. That is, if more than one processor is attempting to read and then write to a particular region of shared memory, then each processor must complete its read write before another processor begins to write to that memory location. Synchronization is achieved by the processor obtaining a lock on that memory location. The lock is usually achieved by the processor executing a sequence of special assembly language instructions. Assembly language instructions are executed by a processor, and in response to some instructions the processor issues commands to the system. Commands issued to the system may be classified into three types: Requests, Probes, and Responses.
Requests are commands issued by a processor when, as a result of executing a load or store instruction, it must obtain a copy of data. Requests are also used to gain exclusive ownership to a data item (cache line) from the system. Requests include Read (Rd) commands, Read/Modify (RdMod) commands, Change-to-Dirty (CTD) commands, Victim commands, and Evict commands, the latter of which specify removal of a cache line from a respective cache. A RdMod command is a read of a block of memory, coupled with a request for ownership of the block of memory which was read. That is, the RdMod command reads and modifies the ownership of the block.
A CTD command is issued by a processor and is executed by the system (often the controller coupled to the memory directory) and obtains ownership of a memory block. After the CTD executes, the processor may then change its cache value of the data by writing to its cache. At some point the cache value may be written back to the memory block, depending on writeback policy.
Probes are commands issued by the system to one or more processors requesting data and/or cache tag status updates. Probes include Forwarded Read (Frd) commands, Forwarded Read Modify (FRdMod) commands and Invalidate (Inval) commands. An Inval command is sent to a processor to invalidate a cache line in that processor""s cache. When a processor P issues a request to the system, the system may issue one or more probes (via probe packets) to other processors. For example if P requests a copy of a cache line (a Rd request), the system sends a Frd probe to a processor having a dirty copy of the data (if any). If P requests exclusive ownership of a cache line (a CTD request or RdMod request), the system sends Inval probes to one or more processors having copies of the cache line.
Moreover, if P requests both a copy of the cache line as well as exclusive ownership of the cache line (a RdMod request) the system sends a forwarded Read Modify command (Fr_RdMod) in a probe to a processor currently storing a xe2x80x9cdirtyxe2x80x9d copy of the line of data in its cache. In this context, a dirty copy of a cache line represents the most up-to-date version of the corresponding cache line or data block.
In response to the Fr_RdMod probe, the dirty copy of the cache line is returned to the initiating processor where the dirty copy is stored in the corresponding cache. The previous cache is invalidated by the system sending Inval Probes to processors holding the previous cache line in their caches. Upon gaining ownership the processor can then write to the valid copy of the data in its cache.
An Inval Probe may be issued by the system to a processor storing a copy of the cache line in its cache, when the cache line is to be updated by another processor.
Responses are commands from the system to processors and/or the Input Output Ports (IOP). The responses carry the data requested by the processor or an acknowledgment corresponding to a request. For Rd and RdMod requests, the responses are Fill and FillMod responses, respectively, each of which carries the requested data. For a CTD request, the response is a CTD-Success (Ack) or CTD-Failure (Nack) response, indicating success or failure of the CTD, whereas for a Victim request, the response is a Victim-Release response.
In a single processor system, when it is desired to read a memory block, and then to write to that memory block, only one processor reads and writes to that memory block. So the read and the write are simply executed in sequence by the assembly language code.
However, in a multiprocessor system having many processors all of which have access to the target memory block, if processor P1 executes a read, then processor P2 may write to that memory block before processor P1 can write to that memory block. Processor P1 then writes to the memory block, and processor P2 data is corrupted. The processors are not properly synchronized.
Further, each processor has a cache memory. Cache coherency methods must be used in order to establish ownership by a processor of a memory block, and to maintain a cache with the current version of the data in the memory block. The cache with the current version of the data of the memory block is referred to as the xe2x80x9cdirtyxe2x80x9d cache line.
An xe2x80x9catomic read writexe2x80x9d is a read and then a write to a memory block by a single processor, where other processors of a multiprocessor system are excluded from writing to that memory block between the read and the write.
An atomic read write may be implemented by a pair of instructions in assembly language code, where the second instruction returns a value from which it can be deduced by the executing assembly language code whether the pair of instructions was executed as if the instructions were atomic. The pair of instructions appears atomic if it appears as if all other operations executed by any processor appear before or after the pair. Thus, when an instruction pair appears atomic, no other processor has changed the value between the instruction pair.
Atomic test-modify-instructions appear in the instruction set of most processors. These are used to implement mutual exclusion and synchronization between processes running on a uni- or multi-processor system. The test-and-modify instructions read a memory location and modify it only if it satisfies some predicate. For instance, a test-and-set instruction typically reads a memory location, tests if the value is xe2x80x9c0xe2x80x9d. If so, it writes a xe2x80x9c1xe2x80x9d to the same location and xe2x80x9csucceedsxe2x80x9d. If not, it leaves the memory location unchanged and xe2x80x9cfailsxe2x80x9d. The instruction is considered atomic since one process must complete its test-and-set sequence before another process is allowed to access the location.
In some modern processors, atomicity is simulated using a pair of instructions such as a LOAD_LOCKED and STORE_CONDITIONAL pair, coupled with a mechanism to detect whether or not the execution proceeded atomically. The mechanism is referred to as the xe2x80x9catomicity-violation-detection mechanismxe2x80x9d. To perform an atomic test-and-modify operation on a memory location, a LOAD_LOCKED (LD_L) instruction and a STORE_CONDITIONAL (ST_C) instruction are executed in sequence and with the same memory location (the xe2x80x9clock locationxe2x80x9d) as argument. One or more instructions may occur between the LD_L and ST_C instructions. The atomicity-violation-detection mechanism is activated when the LD_L instruction is executed. The ST_C instruction performs a write to the lock location only if the preceding LD_L succeeds and the atomicity-violation-detection mechanism indicates that atomicity has not been violated.
The design of the atomicity-violation-detection mechanism may vary from processor to processor. Here we consider a typical design. Consider a process running on a processor, executing a LD_L and ST_C sequence in an attempt to acquire a lock. The atomicity-violation-detection mechanism signals a violation when: (1) another process or processor performs a write to the lock variable or to any other address in the same cache block as the lock variable; OR (2) the processor does a context switch while the atomicity-violation-detection mechanism is active, as a LD_L from a first context could permit a ST_C from a second context to go forward.
The LD_L and ST_C instructions are used in sequence: if the contents of the memory location specified by the load locked are changed before the store conditional to the same address occurs, then the store conditional fails. If the processor does a context switch between the two instructions, then the store conditional also fails. The store conditional is defined to return a value to the executing code indicating whether or not the store was successful. Thus, the load locked returns to the executing code the initial value of the contents of the memory location and, in exemplary implementations the store conditional returns xe2x80x9c0xe2x80x9d if it succeeds and xe2x80x9c1xe2x80x9d otherwise.
Additionally, it is desired for a first processor to be able to gain ownership of a memory block in order to exclude other processors from modifying the memory block. The first processor gains exclusive use of the memory block to read it, or complete another task which also requires that it exclude other processors. The first processor may use an atomic read write in order to gain exclusive ownership of the memory block.
When the code attempts to do an atomic read write to a memory block Z to which a plurality of processors have access, the code is attempting to read the block and then write to the block without another processor changing the block in between the read and write operations.
In multiprocessor systems with private caches, and a cache coherence mechanism with invalidate-on-write semantics, a processor must typically acquire xe2x80x9cownershipxe2x80x9d (that is, an exclusive copy) of a cache block in order to write to any byte(s) in the cache block.
As mentioned above, the traditional method for achieving an atomic read write to a memory block uses both a load locked (LD_L) instruction and a store conditional (ST_C) instruction. These two instructions are assembly language instructions executing in code running on a processor, and they execute in sequence, first the LD_L and later the ST_C. However, other instructions may intervene between the LD_L and the ST_C.
Status at the beginning of the LD_L and ST_C sequence is assumed to be: P1 is executing assembly language code which wants to read memory block Z and then to write a new desired value into memory block Z; P2 has the current xe2x80x9cdirtyxe2x80x9d value of memory block Z in its cache.
First, in response to the LD_L assembly language instruction, the processor P1: initiates its atomicity detection mechanism by writing into a Load Address Register the address of the block to be read and also xe2x80x9csettingxe2x80x9d a Lock Flag; and the processor attempts to read the memory block Z from its cache, and usually generates a cache miss. In response to the cache miss, processor P1 issues a READ system command for memory block Z.
In response to the READ system command for memory block Z, the system, for example the directory, locates the processor P2 whose cache has the current version of the memory block Z, that is the xe2x80x9cdirty copyxe2x80x9d, if any processors have the dirty copy. In the event that no processors have the dirty copy, the READ Request goes to memory to read the value of block Z.
In the event that the READ Request must go to the dirty cache line, the system sends a Forwarded Read Probe to processor P2 having the dirty copy in its cache. In response to the Forwarded Read Probe, processor P2 delivers the value of the memory block Z. The value in the memory block Z is returned to processor P1 in a Fill message. Also, the system sets indicator bits in the directory indicating that P1 has a copy of the memory block Z. When the Fill message from P2 reaches P1, then P1 updates its cache with the cache line containing the current value of memory block Z. Processor P1 then usually writes the value of memory block Z into a register of P1.
In the event that another processor, for example P27, writes to the memory block Z before the ST_C instruction executes, then the Lock Flag in the Load Address Register of P1 is reset. The Lock Flag in the Load Address Register of P1 is reset as follows: in order to write to memory block Z, the other processor, P27, must first obtain ownership of memory block Z. When ownership is transferred to the other processor, P27, by the directory, then the last ownership processor sends invalidate messages, an Inval Probe, to each processor having a copy of the cache line containing memory block Z. The arrival of the Inval Probe at P1, the processor executing the LD_L instruction, causes P1 to reset its Lock Flag.
Execution of the ST_C is next described. The ST_C first checks the Load Address Register to determine if the Lock Flag is still set. In the event that the Lock Flag is set, the ST_C instruction proceeds. In the event that the Lock Flag is reset, the ST_C fails and returns a failure code to the assembly language code.
Execution in P1 of the ST_C assembly language instruction usually begins with a cache miss. P1 has a cache miss because P1 usually does not have ownership of memory block Z. Processor P1 then issues a system command: an ownership request, that is a CTD (Change to Dirty). The CTD command goes to the system, that is to the directory. The system checks whether or not P1 has a valid copy, that is a most recent copy of memory block Z, which it can do by checking the indicator bits in the directory. In the event that P1 has a most recent copy of memory block Z as shown by the indicator bits in the directory, then the directory changes ownership to P1. Also, the system sends an Inval Probe to P2 in order to invalidate the P2 cache line for memory block Z, and also sends Inval probes to any other processor having a current value of memory block Z in its cache. Also, the successful CTD causes the system to return an ACK Reply to P1 in a Response message indicating that the CTD was successful.
In response to receiving the ACK, P1 checks the Lock Flag in its LAR. If the flag is still set, and if there has not been a context switch in code executing on P1, the ST_C instruction proceeds. Otherwise, if the flag is reset the ST_C fails and returns a failure value to the executing assembly language code. Upon failure, ownership of memory block Z is with P1, but P1 does not write the value of the argument of the ST_C into its cache.
In the event that the Lock Flag is still set, then P1 writes the new value of memory block Z into its cache, which is now the new dirty copy of the data of memory block Z. Also, in response to the CTD ACK, the ST_C returns to assembly language code executing on P1 an indicia of success, usually a xe2x80x9c0xe2x80x9d. The code can then do a branch test on the returned value. The new value of memory block Z will be written back from the cache of P1 to memory block Z in due course, depending upon the write back policy used by the system.
In the contrary event that P1 does not have a most recent copy of memory block Z because some other processor has intervened (intervening processor) since the READ and changed the value in memory block Z, then an Inval probe is received by P1 from the controller connected to the directory executing the intervening processor ownership request. The lock bit is reset by P1 in response to receipt of the Inval probe. The ST_C checks the lock bit and finds it xe2x80x9cresetxe2x80x9d, and therefore fails. The ST_C returns to code executing on P1 an indicia of failure, usually a xe2x80x9c1xe2x80x9d. The code can then do a branch test on the returned value. Usually a branch on failure does a loop to repeat the load locked/store conditional sequence until success is achieved.
Status at the end of a successful LD_L and ST_C sequence is: the P2 cache is invalidated, along with all other caches previously holding valid copies of memory block Z; P1 has the value formerly in the memory block Z (actually the dirty value read from the P2 cache) written into a register; P1 has ownership of memory block Z, and P1 has written its desired new value of memory block Z into its own cache, and this is the new dirty value of memory block Z.
Two system commands were issued to accomplish successful execution of the LD_L and ST_C commands in code running on P1: the cache miss on Read by P1; and, the cache miss on Write by P1.
The problem of inter-processor synchronization in a multiprocessor system is described by John Hennessy and David Patterson in their book Computer Architecture a Quantitative Approach, Second Edition, Copyright date 1996, published by Morgan Kaufmann Publishers, Inc., San Francisco, all disclosures of which are incorporated herein by reference, especially at pages 694 through 707.
Also the problem of atomic read/write and inter-processor synchronization in a multiprocessor system is described by David E. Culler and Jaswinder P. Singh in their book xe2x80x9cParallel Computer Architecturexe2x80x9d, published by Morgan Kaufmann Publishers, Inc., San Francisco, all disclosures of which are incorporated herein by reference, especially at pages 391-393.
A difficulty with the load locked/store conditional sequence as described herein above is that a processor may write to the memory block Z after P1 does its Read, and before the CTD issued by P1 arrives at the system directory. For example, if two processors are both trying to do an atomic read/write to memory block Z, then each executes its Read, one does its CTD and then the other fails. The failing processor then repeats its load locked/store conditional sequence by branching into a loop, and will take ownership of memory block Z away from the other processor. Each trade of ownership requires two system commands, and the execution of these system commands contributes to undesirable overhead.
There is needed a method for doing an atomic read/write sequence which reduces the number of system commands and so reduces system overhead during contention for a memory block by two or more processors in a multiprocessor computer system.
There are two significant parts to the invention. First, all LD_L instructions that miss in the processor P1 cache generate ownership read requests, that is RdMod requests. Formerly Read requests were generated by the cache miss from a LD_L instruction. Second, a set of constraints is imposed on xe2x80x9cmemory request messagesxe2x80x9d to eliminate any livelock problem arising from the RdMod Request.
When a processor P2 having the dirty copy of a memory block X in its cache receives a memory ownership request message (usually a forwarded RdMod Probe from another processor) from the cache-coherence mechanism, from some other processor P1 issuing an ownership request to cache block X, processor P2 will supply the requested data and relinquish ownership if there is no Miss Address File (MAF) entry for this address. However, if an outstanding MAF entry exists for this address at processor P2, then processor P2 relinquishes ownership of memory block X only if and when at least one of the three conditions below are true:
1) P2 has executed more than some pre-determined number of instructions since it executed a LD_L instruction, and logically ANDed with the requirement that P2 Miss Address File (MAF) be fully retired to insure that no cache miss Requests are pending;
2) Some pre-determined number of cycles have expired since P2 executed it""s most recent LD_L instruction;
3) A ST_C instruction has been successfully retired since P2 executed it""s most recent LD_L instruction.
Rule 3 requires that the processor wait until the ST_C instruction completes.
Rule 1 and Rule 2 have the processor wait a reasonable time period for the ST_C instruction to execute. However, in the event that for some reason the ST_C instruction never executes, then either Rule 1 or Rule 2 will fire, and the process executing in the processor will go forward. A ST_C instruction may never execute for a number of reasons, for example: the program takes a branch which has no ST_C instruction written; a programming error; . . . etc.
Other and further aspects of the present invention will become apparent during the course of the following description and by reference to the accompanying drawings.