1. Field of the Invention
The present invention relates to techniques for improving the performance of computer systems. More specifically, the present invention relates to a method and an apparatus to facilitate early release of memory locations during transactional program execution.
2. Related Art
Computer system designers are presently developing mechanisms to support multi-threading within the latest generation of Chip-Multiprocessors (CMPs) as well as more traditional Shared Memory Multiprocessors (SMPs). With proper hardware support, multi-threading can dramatically increase the performance of numerous applications. However, as microprocessor performance continues to increase, the time spent synchronizing between threads (processes) is becoming a large fraction of overall execution time. In fact, as multi-threaded applications begin to use even more threads, this synchronization overhead becomes the dominant factor in limiting application performance.
From a programmer's perspective, synchronization is generally accomplished through the use of locks. A lock is typically acquired before a thread enters a critical section of code, and is released after the thread exits the critical section. If another thread wants to enter a critical section protected by the same lock, it must acquire the same lock. If it is unable to acquire the lock because a preceding thread has grabbed the lock, the thread must wait until the preceding thread releases the lock. (Note that a lock can be implemented in a number of ways, such as through atomic operations or semaphores.)
Unfortunately, the process of acquiring a lock and the process of releasing a lock are very time-consuming in modern microprocessors. They involve atomic operations, which typically flush the load buffer and store buffer, and can consequently require hundreds, if not thousands, of processor cycles to complete.
Moreover, as multi-threaded applications use more threads, more locks are required. For example, if multiple threads need to access a shared data structure, it is impractical for performance reasons to use a single lock for the entire data structure. Instead, it is preferable to use multiple fine-grained locks to lock small portions of the data structure. This allows multiple threads to operate on different portions of the data structure in parallel. However, it also requires a single thread to acquire and release multiple locks in order to access different portions of the data structure. It also introduces significant software engineering concerns, such as avoiding deadlock.
In some cases, locks are used when they are not required. For example, many applications make use of “thread-safe” library routines that use locks to ensure that they are “thread-safe” for multi-threaded applications. Unfortunately, the overhead involved in acquiring and releasing these locks is still incurred, even when the thread-safe library routines are called by a single-threaded application.
Applications typically use locks to ensure mutual exclusion within critical sections of code. However, in many cases threads will not interfere with each other, even if they are allowed to execute a critical section simultaneously. In these cases, mutual exclusion is used to prevent the unlikely case in which threads actually interfere with each other. Consequently, in these cases, the overhead involved in acquiring and releasing locks is largely wasted.
Hence, what is needed is a method and an apparatus that reduces the overhead involved in manipulating locks when accessing critical sections.
One technique to reduce the overhead involved in manipulating locks is to “transactionally” execute a critical section, wherein changes made during the transactional execution are not committed to the architectural state of the processor until the transactional execution completes without encountering an interfering data access from another thread. This technique is described in related U.S. patent application Ser. No. 10/637,168, entitled, “Selectively Monitoring Loads to Support Transactional Program Execution,” by inventors Marc Tremblay, Quinn A. Jacobson and Shailender Chaudhry, filed on 8 Aug. 2003.
Load and store operations are modified so that, during transactional execution, they mark cache lines that are accessed during the transactional execution. This allows the computer system to determine if an interfering data access occurs during the transactional execution. If so, the transactional execution fails, and results of the transactional execution are not committed to the architectural state of the processor. On the other hand, if the transactional execution is successful in executing a sequence of instructions, results of the transactional execution are committed to the architectural state of the processor. Note that committing changes can involve, for example, committing store buffer entries to the memory system by ungating the store buffer.
Unfortunately, existing designs for systems that support transactional execution require the hardware to maintain state information about every memory location accessed by the transaction until the transaction completes. Because the hardware resources needed to maintain such state are necessarily bounded, existing designs are not able to accommodate larger transactions that can potentially access a large number of memory locations. For example, a non-blocking implementation of a dynamically sized data structure (such as a linked list) can potentially need to access a large number of memory locations during a single atomic transaction (for example, to scan down the linked list). Hence, what is needed is a method and an apparatus that reduces the amount of state information that the system needs to keep track of during transactional program execution.
Unfortunately, problems can arise while marking cache lines. If a large number of cache lines are loaded, for example, we are more likely to overflow a particular cache set. Furthermore, the marked cache lines cannot be easily moved out of cache until the transactional execution completes, which also causes performance problems.
Hence, what is needed is a method and an apparatus that reduces the number of cache lines that need to be marked during transactional program execution.
One technique for solving this problem uses variations of a load instruction, which causes a memory location to be loaded with an explicit time-to-live value. A release instruction (effectively) decrements every such memory location's time-to-live value, and those locations whose time-to-live value becomes zero are released from the transaction's read set. The principal limitation of this interface is that it fits poorly with accepted software engineering practice. This technique is described in the related U.S. patent application entitled “Selectively Unmarking Load-Marked Cache Lines During Transactional Program Execution,” having Ser. No. 10/764,412, and filing date 23 Jan. 2004, now U.S. Pat. No. 7,089,374, issued 8 Aug. 2006.
Suppose procedure P loads several locations with a load instruction specifying a time-to-live value, and then calls procedure Q. If Q loads an address with a time-to-live value of one, and then uses the release instruction to release that address, the release instruction will have the unintended side-effect of indiscriminately decrementing the time-to-live value of every address previously loaded by P, which may be (and should be) unaware that Q used the release instruction. The system described in the related application has the further disadvantage that it uses “release counters,” which limit the total number of releases that can be performed by a successful transaction; the need to know and reason about this is a significant problem for programmers.
Moreover, the techniques described above do not support modular software design in that they do no permit transactions to be nested: if P starts a transaction and calls Q, and then Q itself starts and commits a transaction, the effects of Q's transaction should be part of P's transactions in a seamless way. The techniques described above do not address how nested transactions can be supported and, in particular, how the nested transactions would interact with early release functionality.
Hence, what is needed is a method and an apparatus to facilitate early release of transactional memory without the above described problems.