The present invention relates to a computer architecture and in particular to a “multicore” architecture providing improved energy efficiency when multiple processor cores are synchronized in the execution of parallel programs sharing common data.
Current microprocessor architectures provide multiple processor cores each providing independent processor functions. Each of the processor cores may execute a different “thread” of a multithreaded program so that the threads may be executed in parallel for improved speed of execution. For example, a computer server for the Internet may use a multicore processor running a multithreaded server program where each separate client-server transaction runs as a separate parallel thread.
During parallel operation, the program threads may need to modify common data shared among the threads. For example, in the implementation of a transaction-based airline reservation system, multiple threads handling reservations for different customers may read and write common data indicating the number of seats available. If the threads are not coordinated in their use of the common data, serious error can occur. For example, a first thread may need to read a variable indicating an airline seat is available and then set that variable indicating that the seat has been reserved by the thread's client. If a second thread reads the same variable prior to its setting by the first thread, the second thread may, based on that read, erroneously set that variable again with the result that the seat is double booked.
To avoid these problems, it is common to use synchronizing instructions (“synchronizing primitives”), for example a lock variable, to synchronize threads during a portion of their operation (often called critical sections) where simultaneous execution by more than one thread would be a problem. A lock variable has one value indicating that it is owned by a thread and another value indicating that it is available. A thread must acquire the lock before executing the critical section and does so by reading the lock variable and, if it is not held, writing a value to it indicating that it is held. When the critical section is complete, the thread again writes to the lock variable a value indicating that the lock is available again.
Typically, the instructions used to acquire the lock are “atomic instructions”, that is, instructions that cannot be interrupted once begun by any other thread, or quasi-atomic instructions that can be interrupted by another thread, but that make such interruption evident to the interrupted thread so that the instructions can be repeated.
When a thread cannot acquire the lock, the processor executing the thread may “spin”, ceasing forward progress in execution of the thread while it waits for availability of the lock. This weighting process entails periodic checking of the lock variable typically stored in memory. Other synchronization primitives such as barriers and conditions also employ this spinning technique.
Other types of synchronization instructions, for example, may cause the processors to spin at a given execution point in each thread until all processors have reached that execution point. Generally three types of synchronization primitives (barriers, locks, and conditions) cause one or more processors to spin as they wait for the synchronization process to complete.
While the processor is spinning, it does not perform useful work but consumes substantial power. Generally, the processor cannot be powered down while spinning in order to preserve its architectural states so that it may later resume operation and to allow the processor to operate so that it can check on a periodic basis to determine whether the lock has become available.
One method to avoid wasting processor resources is to suspend the given thread and switch the associated processor to begin executing a different thread, for example, using a futex technique. In order to do this, the operating system or another mechanism must be provided to switch the processor context and wake the sleeping thread when the lock is available. Such an approach requires multiple operating system “calls” that each consume many processor cycles.