The present invention relates to computer architectures for multiprocessor systems and in particular to an architecture providing improved cache control when coordinating the exclusive use of data among multiple processors.
Computer architectures employing multiple processors working with a common memory are of particular interest in web servers. In such an application, each processor may serve a different web site whose content and programs are shared from the common memory.
In situations like this, each of the processors may need to modify the shared data. For example, in the implementation of a transaction-based reservation system, multiple processors handling reservations for different customers must read and write common data indicating the number of seats available. If the processors are not coordinated in their use of the common data, serious error can occur. For example, a first processor may read a variable indicating an available airline seat and then set that variable indicating that the seat has been reserved by the processor""s customer. If a second processor reads the same variable prior to setting by the first processor, it may, based on that read, erroneously set the variable again, with the result that the seat is double booked.
To avoid these problems, it is common to use synchronizing instructions for portions of a code (often called critical sections) in which simultaneous access by more than one processor is prohibited.
Synchronizing instructions may be used in two general ways. The first is that the synchronization instruction may provide an atomic function, that is, a function that cannot be interrupted, once begun, by any other processor. Such instructions may perform an atomic read/modify/write sequence as could be used in the above example. A modification of this use is a pair of xe2x80x9cbookendxe2x80x9d synchronization instructions (such as Load Lock/Store Conditional) that provide a quasi-atomic execution of intervening instructions, in which interruption by other processors cannot be prevented, but can be detected so that the instructions may be repeated until no interruption occurs.
The second way is that the synchronizing instruction may be used to take ownership of a lock variable, which must be owned for modification of other shared data. An atomic synchronization instruction is used to check for the availability of the lock (by checking its value) and if it is available, to take ownership of the lock variable (by changing its value).
In the first use of synchronization instructions, the critical section is short and well defined by the critical section. In the second case, where a lock variable is acquired, the critical section may be arbitrarily long and is not well defined. xe2x80x9cSynchronization instructionxe2x80x9d as used herein refers broadly to memory access instruction that permits mutual exclusion operations, that is, the exclusion of concurrent access to the same memory addresses by other processors during the access operations.
Like single processor systems, multiprocessors systems may employ cache memory. Cache memory is typically redundant local memory of limited size that may be accessed much faster than the larger main memory. A cache controller associated with the cache attempts to prefetch data that will be used by an executing program and thus to eliminate the delay required for accessing data on the main memory. The use of cache memory generally recognizes the reality that processor speeds are much faster than memory access speeds.
In multiprocessor systems, sophisticated cache coordination protocols, known in the art, are used to ensure that multiple copies of data from the main memory are properly managed to avoid errors caused by different processors working on their different cache copies of main memory. These protocols may work by means of bus communication between different cache controllers, or by using a single common directory indicating the status of multiple caches and their contents. In these cases, the protocols provide for unique ownership of data when a processor writes to its cache copy through an invalidation of other copies. Alternatively, the protocols may broadcast all processor writes without providing for unique ownership.
In addition to the obvious delays resulting from lock contention, synchronization instructions used in multiprocessor systems can create inefficiencies in the movement of data between main memory and the caches of the multiple processors. For example, after execution of the synchronization instructions necessary to acquire a lock variable by a first processor, and the loading of a cache line holding the lock variable into the cache of the first processor, a second processor may attempt to acquire the same lock. The lock variable is then transferred to the cache of the second processor, where it cannot be acquired because the lock is already owned, and then must be transferred back again to the first processor for release of the lock, and then transferred to the second processor again for the lock to be acquired. As is understood in the art, a cache line is the normal smallest unit of data transfer into a cache from another cache or memory.
One of the present inventors has recognized in a jointly authored prior art paper entitled Efficient Synchronization Primitives For Large-Scale Cache-Coherent Shared-Memory Multiprocessors, published April 1989 in the xe2x80x9cProceedings of the Third Symposium on Architectural Support for Programming Languages and Operating Systemsxe2x80x9d, pgs. 64-75, that many of these problems could be avoided by having the programmer or compiler explicitly identify critical sections. By providing an explicit demarcation of the critical section through special delimiting instructions, a processor holding a lock as part of the execution of a critical section would be empowered to defer requests by other caches for the cache line holding the lock variable until the lock was released. Each processor waiting for the lock, including the deferred processor, would effectively form a queue for that lock providing a more efficient method of sharing access to the common synchronized data.
Unfortunately such a system requires both a change in architecture and a fundamental rewriting of existing programs and/or compilers in order to indicate the boundaries of the critical sections. While such changes may occur on future generations of programming languages and programs, they do not address the large body of existing programs that might be executed in a multiprocessor system.
The present invention recognizes that with a high degree of reliability, the location and size of a critical section can be inferred without the need for special delineators. Generally, the beginning of the critical section may be inferred by the occurrence of any of a number of pre-existing synchronization instructions. The end of the synchronizing section, while less easy to determine, may be inferred from a second synchronization instruction forming part of a bookend synchronization instruction pair or by the writing to the same memory location accessed by the first synchronization instruction, which is assumed to be a release of a lock variable.
Specifically then, the present invention provides a method of controlling a cache used in a computer having multiple processors and caches communicating through memory. In a first step of the method, as a program is executed by a first processor, a probable initiation of a critical section in the program is inferred from the standard instructions being executed. In response to this inference, the cache of the first processor is loaded with at least one synchronization variable. Prior to completion of the critical section of the executed program, response to other caches requesting write access to the synchronization variable is delayed.
It is therefore one object of the invention to improve data flow between caches during execution of a critical section of a program, in a way that will work with preexisting programs (or programs that have been generated by pre-existing compilers) that have not been modified to explicitly delineate the critical sections. The ability to infer, during program execution, probable initiation and termination of a critical section allows intervening cache requests to be delayed improving global performance of the multiprocessor system.
The beginning of the critical section may be inferred by detecting at least one instruction normally associated with a synchronizing operation, for example, a LOAD-LINKED type instruction.
Thus it is an object of the invention to provide a simple way of hardware detection of the beginning of a critical section.
The delayed cache requests may be specially designated xe2x80x9cdeferrablexe2x80x9d requests and the cache controller may also recognize xe2x80x9cundeferrablexe2x80x9d requests for write access wherein prior to completion of the critical section, the cache controller responds without delay to other caches requesting undeferrable write access to the synchronization variable.
It is thus another object of the invention to improve cache operation where the critical section is wrongly inferred to have ended and another cache has loaded the data of the xe2x80x9clock variablexe2x80x9d that must be re-obtained by the cache of the first processor so that a lock can be released. The undeferrable requests allow the first processor to recover the lock variable so it can release the lock variable without being delayed by the delay mechanism it applied to other processors.
The cache controller may mark a response to an undeferrable request with an instruction to release the synchronization variable back to the first processor at the conclusion of the requesting processor""s use of the synchronization variable.
Yet another object of the invention is to permit undeferrable requests to override the delay mechanism of the present invention without disrupting the queue formed by deferrable requests among processors.
The cache controller of the first processor may obtain the synchronization variable for a synchronization operation by making a deferrable request to memory or other caches.
It is therefore another object of the invention for all caches to conform to the convention of deferrable and undeferrable requests both in requesting synchronization variables and in response to requests.
The cache controller may provide read access to the synchronization variable to the caches requesting deferrable access while delaying the response to the caches requesting write access.
Thus it is another object of the invention to allow caches that are placed in queue by the delay, to nevertheless resolve the value of lock variables that may be used in their critical sections, without giving them control of the cache for writing.
The critical section of the executed program may modify protected data and its relationship to the synchronization variable may be inferred. A request by a second processor for the synchronization variable, or possibly even the protected data, may trigger a response from the first processor including not only the synchronization variable, but other data associated with the lock, and provided before it is requested by the second processor.
Thus it is another object of the invention to collocate lock data and protected data to be modified so as to provide more rapid execution of critical sections.
A delayed request for access to the synchronization variable may be buffered and upon completion of the critical section by the first processor, the cache controller may provide the synchronization variable to the processor associated with the first buffered request with the synchronization variable.
Thus one additional object of the invention is to provide a queuing of processors needing to execute the critical section.
Completion of the critical section may be inferred when the first processor issues a first write instruction to the address of the synchronization variable. That write instruction may, but need not be, a STORE-CONDITIONAL-type instruction.
Thus it is another object of the invention to provide a simple method of inferring the conclusion of a critical section.
Alternatively, the completion of the critical section may be determined by the given processor issuing a second write instruction to the address of the synchronization variable. The second instruction may be a standard store-type instruction.
Thus it is another object of the invention to provide an inferential rule which works both for short critical sections, composed of a single, atomic or quasi-atomic operation, and for long critical sections providing a lock-based modification of many data elements.
Whether the critical section is a short or long form may be determined by consulting a prediction table.
Thus it is another object of the invention to provide for speculation as to the type of use of the critical section allowing for flexible implementation of the invention without modification of the program by the programmer for a compiler.
The foregoing objects and advantages may not apply to all embodiments of the inventions and are not intended to define the scope of the invention, for which purpose claims are provided. In the following description, reference is made to the accompanying drawings, which form a part hereof, and in which there is shown by way of illustration, a preferred embodiment of the invention. Such embodiment also does not define the scope of the invention and reference must be made therefore to the claims for this purpose.