The present invention relates to computer systems, and more particularly to such systems executing multiple threads.
Computer systems including multiprocessor (MP) and single processor systems may include a plurality of “threads,” each of which executes program instructions independently from other threads. Use of multiple processors allows various tasks or functions, and even multiple applications, to be handled more efficiently and with greater speed. Utilizing multiple threads or processors means that two or more processors or threads can share the same data stored within the system. However, care must be taken to maintain memory ordering when sharing data.
For data consistency purposes, if multiple threads or processors desire to read, modify, and write to a single memory location, the multiple agents should not be allowed to perform operations on the data simultaneously. Further complicating the use of multiple processors is that data is often is stored in a cache associated with a processor. Because such caches are typically localized to a specific processor, multiple caches in a multiprocessor computer system can contain different copies of a given data item. Any agent accessing this data should receive a valid or updated (i.e., latest) data value, and data being written from the cache back into memory must be the current data so that cache coherency is maintained.
Memory instruction processing acts in accordance with a target instruction set architecture (ISA) memory order model. For reference, Intel Corporation's two main ISAs: Intel® architecture (IA-32 or x86) and Intel's ITANIUM® processor family (IPF) have very different memory order models. In IA-32, load (i.e., read) and store (i.e., write) operations must be visible in program order, while in the IPF architecture, they do not in general. Further, while executing multiple threads in a chip multiprocessor (CMP) or other MP system, ordered memory instructions are used in synchronization and communication between different threads.
Multithreaded (MT) software uses different mechanisms to interact and coordinate between different threads. Two common forms of MP synchronization are barriers and semaphore spin-locks. A barrier mechanism helps a program synchronize different threads at predefined points in the program. Typically, each thread either increments or decrements a memory variable in an atomic fashion when it reaches such a point. Every thread then waits for the memory variable to reach a predetermined barrier level. Synchronization is achieved once all threads have completed the updates. When the barrier is reached, all threads can then proceed.
A semaphore spin-lock mechanism is used to guarantee mutual exclusion across multiple threads while accessing a shared memory variable or structure (i.e., a shared element). In order to provide a unique and consistent view of the shared element, it is guarded by a lock variable. Every thread needing access to the shared element must acquire the guarding lock (i.e., locking) via an atomic semaphore operation. When a lock is acquired, the remaining threads can only acquire the lock after it is released (i.e., unlocking) by the original requester. Only the thread that acquired the lock performs operations/updates on the shared element (software convention), thus mutual exclusion is ensured. Locking is performed by designating a particular value to represent a locked state, and a different value to represent an unlocked state. Each thread seeking to access the shared element acquires the lock by updating the lock variable atomically to the lock value (after possibly checking that the lock has not already been acquired).
Most ISA's provide specific semaphore instructions to achieve MP synchronization between multiple threads or processors. Among these, an atomic-add is a popular instruction for a barrier synchronization mechanism. However, known barrier synchronization methods and semaphore spin-locks cause inefficiencies. Barrier mechanisms typically require significant traffic, such as inter-processor cache traffic, as the lock variable moves to different cores of the multiprocessor. Similarly, spin-lock mechanisms require significant traffic between different processor cores. Still further, an atomic-add instruction requires that the shared variable be brought deep into processor cores to perform the add operation, again requiring significant traffic, as well as utilizing processor resources. Accordingly, a need exists for improved manners of synchronization between multiple threads.