1. Field of the Invention
The present invention relates to techniques for improving the performance of computer systems. More specifically, the present invention relates to a method for controlling contention between conflicting transactions in a transactional memory system.
2. Related Art
Computer system designers are presently developing mechanisms to support multi-threading within the latest generation of Chip-Multiprocessors (CMPs) as well as more traditional Shared Memory Multiprocessors (SMPs). With proper hardware support, multi-threading can dramatically increase the performance of numerous applications. However, the requirements of concurrent software can diminish these gains in performance. In concurrent software, it is important to guarantee that one thread cannot observe partial results of an operation being executed by another thread. These guarantees are necessary for practical and productive software development because without them, it is extremely difficult to distinguish the interactions of concurrent threads.
One method to provide these guarantees is to use locks to prevent other threads from accessing the data affected by an ongoing operation. Such use of locks gives rise to well known problems in terms of software engineering and in terms of performance. First, the right balance of locking must be achieved, so that correctness is maintained, while ensuring that the use of a particular lock does not prevent access to an unnecessary amount of unrelated data which can cause other threads to wait when they do not have to wait. Furthermore, if proper care is not taken, the use of locks can result in deadlock, causing software to freeze up. While well understood, these problems are pervasive in concurrent programming, and addressing them often results in code that is more complicated and expensive.
No matter how carefully used, locks always have the problem that if a thread is delayed while holding a lock, then other threads must wait for at least the duration of that delay before being able to acquire that lock. In general, operating systems and other runtime environments cannot avoid this problem because they cannot accurately predict how long a particular lock will be held, and they cannot revoke the lock without jeopardizing correctness.
Another method to provide these guarantees is the obstruction-free algorithm approach, which provides weak progress guarantees algorithmically and hooks to a software-controlled contention-management system. Unfortunately, it is considerably more difficult to implement something in an obstruction-free manner without transactional memory than with transactional memory.
Another method is to use transactional memory. Transactional memory allows the programmer to think as if multiple memory locations can be accessed and/or modified in a single atomic step. Thus, in many cases, it is possible to complete an operation with no possibility of another thread observing partial results, even without holding any locks. This significantly simplifies the design of concurrent programs.
Transactional memory can be implemented in hardware with the hardware directly ensuring that a transaction is atomic. It can also be implemented in software that provides the illusion that the transaction is atomic, even though in fact it is executed in smaller atomic steps by the underlying hardware. A hybrid approach can be used for taking advantage of the best features of hardware and software transactional memory. This hybrid approach first attempts a transaction in hardware, and if it fails some number of times, it retries the transaction in software. Note that transactions executed in software are significantly slower than those executed in hardware, so it is preferable that transactions succeed in hardware as often as possible.
Most designs for hardware transactional memory and related mechanisms use the same basic approach, building on the system's caches and cache coherence protocols to effect atomic transactions. While executing a transaction, when the processor acquires either shared or exclusive ownership of a cache line in order to access a variable stored in that cache line, the cache line is marked as transactional. Stores executed during the transaction are recorded in a store buffer, and do not become visible to other processors before the transaction successfully commits. If the processor retains ownership of all transactional cache lines until the transaction completes, the processor can then atomically commit the transaction by transferring values from the store buffer into the corresponding cache lines before relinquishing ownership of any of the transactional cache lines. If the processor loses ownership of a transactional cache line before the transaction completes, however, the transaction is aborted, and must be retried.
Unfortunately, when a processor that implements transactional memory receives a request for a cache line that it owns, it immediately relinquishes the cache line, and aborts the current transaction if the cache line is marked as transactional. Note that if the transaction is then retried, it is likely to request the same cache lines it accessed previously. This behavior can cause a “ping pong” effect, in which two or more transactions repeatedly cause each other to abort, leading to livelock.
This livelock problem can be addressed by various contention control techniques. One such technique is to “backoff” by making a thread wait for some time before retrying an aborted transaction, in the hope that other threads can make progress in the meantime. If a thread's transaction fails repeatedly, it can increase its waiting time between each retry. This way, the hope is that eventually contention is reduced sufficiently for transactions to commit.
While backoff can be effective in some cases, it is has a number of drawbacks, including the need to tune backoff parameters, and being ineffective when faced with a mix of short and long transactions. More general contention control mechanisms can be employed, in which threads share information about their transactions in order to better control which threads back off when, and for how long. Nevertheless, the degree of contention control possible with most previous transactional memory designs is limited. A contended-with transaction is always aborted, so the only possibility for managing contention is to attempt to avoid it altogether by delaying entire transactions.
Transactional Lock Removal (TLR) is another technique that aims to address the issue of deciding what to do when there is contention for a cache line. TLR allows critical sections protected by locks to be executed atomically without acquiring the lock, thereby allowing multiple threads to execute critical sections protected by the same lock concurrently when they do not conflict. Note that TLR is not transactional memory because it does not provide an interface for transactional programming. Nevertheless, the underlying mechanisms used by TLR to ensure atomicity are similar to those in transactional memory designs like.
When a processor executing a transaction using TLR receives a request for a cache line that is marked as transactional, it does not necessarily abort the current transaction immediately. Instead, TLR can effectively queue such requests, and continue to execute the current transaction, hopefully to completion. Notice that such a scheme must deal with multiple outstanding requests and provisioning resources in each processor for the maximum number of possible requests may be excessive. One way to deal with this problem is to distribute the queue among the participating processors. This is achieved by exploiting a property of some cache coherence protocols in which ownership of a cache line and possession of that cache line do not necessarily coincide. When a processor receives a request for a cache line that it owns, it can grant ownership to the requestor immediately, but delay sending the data for the cache line until the current transaction has completed. Because the requester has become the owner, subsequent requests for the same cache line will be routed to the new owner, which can itself grant ownership to a subsequent requester, even before receiving the data. This way, a queue of requests is distributed amongst the participating processors, with each having to remember (for each transactional cache line) only the identity of the processor to which it granted ownership. The decision between immediately responding to a request for a transactional cache line, thus causing the current transaction to fail, and deferring a request in the hope that the current transaction can succeed is made based on a timestamp-based priority scheme. This scheme can result in deadlock, so a special mechanism for dealing with this problem is required.
The TLR technique has several drawbacks. First, it is applicable only in conjunction with cache coherence protocols in which ownership of a cache line and possession of its data can be held separately. Second, it hardcodes a single strategy for dealing with contention, which may be effective for some workloads but ineffective for others. Finally, the approach devotes significant hardware resources and complexity to attempting to ensure lock-free execution. However, it does not in general achieve this goal, because some transactions will exceed the resources of the cache, requiring the scheme to fall back on the original approach of acquiring the lock.
The disadvantages of the TLR approach are partly due to the fact that TLR is not a transactional memory implementation. It is intended to improve the performance of existing lock-based code. Therefore, there is no opportunity for software to exploit knowledge of the particular application, or to adapt to current load and conditions.
Yet another technique is transient blocking synchronization (TBS), which proposes a middle ground between non-blocking and blocking approaches to synchronization. In TBS, one thread can block another, but only for a fixed amount of time, thus avoiding the long delays associated with standard blocking techniques. In TBS, a processor can have exclusive control of a resource (for example a memory location) for an amount of time it predicts will be sufficient to complete its operation. The processor holds a lease on the location for this period. In contrast to simple blocking approaches, if the lease expires without completion, the resource can be revoked without violating correctness. However, the performance of TBS-based proposals suffer because cache-based TBS implementation suitable for constructing transactional memory have not been proposed yet.
Hence, what is needed is a method and an apparatus to facilitate less expensive and more flexible contention control between conflicting transactions in hardware transactional memory for shared-memory multiprocessors without the above-described problems.