The present invention relates to instruction support for memory-to-memory atomic copy and memory-to-memory compare/exchange instructions in a multiprocessor computer system. More particularly, for threads employing non-blocking synchronization schemes, the instruction support reduces the time in which processors (and hence applications) are subject to memory access lockouts (spin-locks).
The Pentium Pro(copyright) processor, commercially available from Intel Corporation of Santa Clara, Calif., provides support for several register-to-memory instructions: loads, stores and compare-and-exchange. The load and store instructions transfer data from registers within the processor to a public memory location or vice versa. The compare and exchange operation compares a value stored in a target address with a value stored in a predetermined pair of registers and, if they are equal, writes the value from a second register pair to the target address. In the Pentium Pro(copyright) processor, registers are 32 bits wide; the largest increment of data to be transferred atomically according to these instructions is 64-bits (using the double-word compare-exchange instruction).
The copy (load/store) and compare-and-exchange instructions provide tools for software designers to manage data structures. Of course, software designers routinely manage data structures that are much greater than the size of a single 32-bit register or even a collection of registers. To extend the set of atomic instructions to data sizes that are much larger than a register width, that approach the size of a processor""s cache line, the instructions become memory-to-memory data transfers rather than register-to-memory transfers. An instruction requires access to a source address in memory and to a target address in memory. By gaining access to two addresses during the course of program execution, particularly in multi-threaded environments, extension of the copy and compare-and-exchange instruction raises a risk that blocking may occur among parallel threads.
A multi-processor computer system (even a uniprocessor computer system) may include a plurality of xe2x80x9cthreads,xe2x80x9d each of which executes program instructions independently from other threads in the system. Multiple agents in a system each are independent threads in the system. Additionally, as is known, resources of a single processor may be shared among several execution processes that are independent from each other. Although the processes execute on a single processor, they may be considered separate threads because their execution is independent from each other much in the same way that execution among two or more processors may be independent from each other. Herein, we use the term xe2x80x9cthreadxe2x80x9d to refer to any logically independent processing agent, regardless of whether the threads are distributed over a single processor (time-multiplexed threading) or multiple processors (space-multiplexed threading) in a computing system.
Blocking may occur when two or more threads attempt to gain ownership of a single data element. Typically, threads engage in administrative protocols to synchronize with each other and to ensure that they use the most current copy of data available to the system. One coherency technique involves locks that are applied to data by a thread while the thread uses the data. If a thread were required to update a data structure, for example, the thread typically must acquire a lock, perform the data update and then release the acquired lock. Other threads that required access to the same data structure would be denied access so long as the lock were applied. The lock renders the first thread""s data update operations xe2x80x9catomicxe2x80x9d because no other thread can gain access to the locked data until the data structure is completely updated. These locks can lead to significant performance bottlenecks because (1) threads waiting for access to the locked data structure waste CPU cycles until the lock becomes available, and, more importantly (2) threads holding a lock can be interrupted by other processes by a long-latency operation (e.g. due to a page fault or a interval-timer interruptxe2x80x94often in the millisecond range). In this circumstance, a first thread that acquired a lock would not make forward progress because it was interrupted and another thread requiring access to the locked data structure also could not make forward progress because it was denied access (because the interrupted thread holds the sought after lock). Both threads, the one holding the lock and the one seeking the lock, fail to make progress even though the data structure is not actively being used.
Non-blocking synchronization (xe2x80x9cNBSxe2x80x9d) programming techniques also provide data coherency guarantees but they permit multiple threads to read and update data simultaneously. NBS techniques assign version stamps to data elements. When a thread operates on data, the thread may read a new version stamp from memory and compare it to an older copy of the version stamp that the thread read prior to its data operation. If the two version stamps are identical, the thread can confirm that the results of its data operation are valid. If not, the data element is assumed to have been updated while the data operations were in progress; the thread typically re-reads the data, the version stamps and retries the data operation.
No known processor provides atomic copy or compare-and-exchange instruction support for larger-than-register-size memory-to-memory transfers. No known processor provides such instruction support in the context of an NBS scheme. Accordingly, there is a need in the art for instruction support for memory-to-memory copy and compare-and-exchange instructions that operate on data sizes that approach a cache line size in a processor.