1. Field of the Invention
The present invention relates generally to the field of computer-based memory systems, and, more particularly, to architecture support of atomic transactions for multiprocessor systems.
2. Description of the Related Art
Atomic transactions have been widely used in parallel computing and transaction processing. An atomic transaction generally refers to the execution of multiple operations, such that the multiple operations appear to be executed together without any intervening operations. For example, if a memory address is accessed within an atomic transaction, the memory address should not be modified elsewhere until the atomic transaction completes. Thus, if a processor (or a thread in a multithreading environment) uses an atomic transaction to access a set of memory addresses, the atomic transaction semantics should guarantee that another processor (or another thread) cannot modify any of the memory addresses throughout the execution of the atomic transaction.
Atomic transactions can be implemented at software level with appropriate architecture support. Modern microprocessors generally provide appropriate synchronization instructions, such as test-and-set and compare-and-swap, for supporting atomicity of a read-modify-write operation on a memory address (typically with 4-byte or 8-byte granularity). A program or an operating system can use such synchronization instructions to achieve exclusive acquisition of semaphores for supporting atomic transactions.
The PowerPC® architecture, for example, provides load-and-reserve and store-conditional instructions. When a processor performs a load-and-reserve instruction on a memory address, the processor reads data from the memory address to a target register, and creates a reservation for the memory address. When the processor later performs a store-conditional instruction on the memory address, if the corresponding reservation remains effective, the processor writes data from a source register to the memory address. However, the reservation for the memory address may get cleared if the memory address is accessed by another processor before the store-conditional instruction is performed. As a result, execution of the store-conditional instruction completes without modifying the memory address.
The PowerPC® architecture supports the following load-and-reserve and store-conditional instructions: lwarx (load word and reserve indexed) and stwcx (store word conditional indexed) for 32-bit data, and ldarx (load double word and reserve indexed) and stdcx (store double word conditional indexed) for 64-bit data. An operating system can use load-and-reserve and store-conditional instructions to implement high-level synchronization functions, such as test-and-set and compare-and-swap, as library primitives. Application programs can use such library primitives, rather than directly using load-and-reserve and store-conditional instructions, to implement atomic transactions as needed.
Atomic transactions can also be implemented directly at architecture level with proper architecture and micro-architecture support, rather than at software level via semaphores and synchronization instructions as described above. Architecture-level atomic transactions, when properly used, can potentially improve overall performance, due to speculative executions of atomic transactions as well as elimination of semaphore acquisitions. Furthermore, using architecture-level atomic transactions can potentially improve software productivity because programmers may not need to worry about using semaphores to achieve desired atomicity semantics. The Transactional Coherence and Consistency (“TCC”) model, for example, provides a shared-memory model in which atomic transactions are always the basic units of parallel programming and memory consistency. Supporting atomic transactions architecturally often requires expensive hardware and software enhancements, such as large on-chip buffers for atomic transactions, and software-managed memory regions for on-chip buffer overflows.
A symmetric multiprocessing (“SMP”) system usually employs a snoopy mechanism to ensure cache coherence. When a cache miss occurs, the requesting cache may send a cache request to memory and all its peer caches. When a peer cache receives the cache request, the peer cache performs a cache snoop operation and produces a cache snoop response indicating whether the requested data is found. If the requested data is found in a peer cache, the peer cache can source the data to the requesting cache via a cache-to-cache transfer. The memory is responsible for supplying the requested data if the data cannot be supplied by any peer cache.
A number of snoopy cache coherence protocols have been proposed. For example, the MESI coherence protocol and its variations have been widely used in SMP systems. As the name suggests, MESI has four cache states: modified (M), exclusive (E), shared (S) and invalid (I). If a cache line is in an invalid state, the data in the cache is not valid. If a cache line is in a shared state, the data in the cache is valid and can also be valid in other caches. The shared state is entered when the data is retrieved from memory or another cache, and the corresponding snoop responses indicate that the data is valid in at least one of the other caches. If a cache line is in an exclusive state, the data in the cache is valid, and cannot be valid in another cache. Furthermore, the data in the cache has not been modified with respect to the data maintained at memory. The exclusive state is entered when the data is retrieved from memory or another cache, and the corresponding snoop responses indicate that the data is not valid in any other cache. If a cache line is in a modified state, the data in the cache is valid and cannot be valid in another cache. Furthermore, the data has been modified as a result of a store operation, and the memory has not been updated.
When a cache miss occurs, if the requested data is found in both the memory and another cache, supplying the data to the requesting cache via a cache-to-cache transfer is often preferred because cache-to-cache transfer latency is usually smaller than memory access latency. The IBM® Power4 system, for example, enhances the MESI protocol to allow more cache-to-cache transfers. The Power4 system enables data of a shared cache line to be supplied from one cache to another cache in the same module. In addition, when data of a modified cache line is supplied to another cache, the modified data is not necessarily written to the memory at the same time. A cache with the modified data can be held responsible for updating the memory when the modified data is eventually replaced from the cache.