This invention relates to operation of computer systems having shared memory, and more particularly to a transactional memory for use in multiprocessor systems.
Synchronizing access to shared data structures is one of the oldest and most difficult problems in designing software for shared-memory multiprocessors. Without careful synchronization, a data structure may be left in an inconsistent state if different processes try to modify it at the same time. Conventional techniques for synchronizing access to shared data structures in shared-memory multiprocessors center around mutual exclusion protocols. Each data structure has an associated lock. Before a process can access a data structure, it must acquire the lock, and as long it holds that lock, no other process may access the data structure.
Nevertheless, locking is poorly suited for modern shared-memory multiprocessor architectures, for several reasons. First, locking is poorly suited for processes that must modify multiple data objects, particularly if the set of objects to be modified is not known in advance. Care must be taken to avoid deadlocks that arise when processes attempt to acquire the same locks in different orders. Second, if the process holding a lock is descheduled, perhaps by exhausting its scheduling quantum, by a page fault, or by some other kind of interrupt, then other processes capable of running may be unable to progress. Third, locking interacts poorly with priority systems. A lower-priority process may be preempted in favor of a higher-priority process, but if the preempted process is holding a lock, then other, perhaps higher priority processes will be unable to progress (this phenomenon is sometimes called "priority inversion"). And fourth, locking can produce "hot-spot" contention. In particular, spin locking techniques, in which processes repeatedly poll a lock until it becomes free, perform poorly because of excessive memory contention.
By contrast, a concurrent object implementation is non-blocking if some process is guaranteed to complete an operation after the system as a whole takes a finite number of steps (referred to as atomicity, as will be described). This condition rules out the use of locking, since a process that halts while holding a lock may force other processes trying to acquire that lock to run forever without making progress.
Described herein is a new multiprocessor architecture that permits programmers to construct non-blocking implementations of complex data structures in a simple and efficient way. This architecture is referred to as transactional memory, and consists of two parts: (1) a collection of special machine instructions, and (2) particular techniques for implementing these instructions.
Most of the programming language constructs proposed for concurrent programming in the multiprocessor with shared memory model employ locks, either explicitly or implicitly (Andrews et al, "Concepts and notations for concurrent programming," ACM Computing Surveys, Vol. 15, No. 1, pp. 3-43, March 1983, disclose a survey). Early-locking algorithms used only load and store operations, as disclosed in Dijkstra, "Co-operating sequential processes," pp. 43-112, Academic Press, New York, 1965, in Knuth, "Additional comments on a problem in concurrent programming control," Communications of the ACM, Vol. 9, No. 5, pp. 321-322, May 1966, in Peterson, "Myths about the mutual exclusion problem," Information Processing Letters, Vol. 12, pp. 115-116, June 1981, and in Lamport, "A new solution of Dijkstra's concurrent programming problem," Communications of the ACM, Vol. 18, No. 8, pp. 453-455, August 1974.
These algorithms using only load and store operations, however, are cumbersome and inefficient, so current practice is to provide support for read-modify-write (RMW) operations directly in hardware. A read-modify-write operation is parameterized by a function f. It atomically (1) reads the value v from a location, (2) computes f(v) and stores it in that location, and (3) returns v to the caller. Common read-modify-write operations include TEST&SET, atomic register-to-memory SWAP (see Graunke et al, "Synchronization algorithms for shared memory multiprocessors," IEEE Computer, Vol. 23, No. 6, pp. 60-70, June 1990), FETCH&ADD (see Gottlieb et al, "The NYU Ultracomputer--designing an MIMD parallel computer," IEEE Trans. on Computers, Vol. C-32, No. 2, pp. 175-189, February 1984), COMPARE&SWAP (see IBM, "System/370 principles of operation," Order No. GA22-7000), and LOAD.sub.- LINKED and STORE.sub.- CONDITIONAL (see Jensen et al, "A new approach to exclusive data access in shared memory multiprocessors," Technical Report UCRL-97663, Lawrence Livermore National Laboratory, November 1987).
Although these hardware primitives were originally developed to support locking, they can sometimes be used to avoid locking for certain data structures. A systematic analysis of the relative power of different read-modify-write primitives for this purpose is given in Herlihy, "Wait-free synchronization," ACM Trans. on Programming Languages and Systems, Vol 13, No. 1, pp. 123-149, January 1991. If an architecture provides only read and write operations, then it is provably impossible to construct non-blocking implementations of many simple and familiar data types, such as stacks, queues, lists, etc. Moreover, many of the "classical" synchronization primitives such as TEST&SET, SWAP, and FETCH&ADD are also computationally weak. Nevertheless, there do exist simply universal primitives from which one can construct a non-blocking implementation of any object. Examples of universal primitives include COMPARE&SWAP, LOAD.sub.- LINED and STORE.sub.- CONDITIONAL, and others.
Although the universal primitives are powerful enough in theory to support non-blocking implementations of any concurrent data structure, they may perform poorly in practice because they can update only one word of memory at a time. To modify a complex data structure, it is necessary to copy the object, modify the copy, and then to use a read-modify-write operation to swing a base pointer from the old version to the new version. Detailed protocols of this kind for COMPARE&SWAP and for LOAD.sub.- LINED and STORE.sub.- CONDITIONAL have been published (see Herlihy, "A methodology for implementing highly concurrent data structures," Proc. 2nd ACM SIGPLAN Symp. on Princ. and Practice of Parallel Programming, pp. 197-206, March 1990, and Herlihy, "A methodology for implementing highly concurrent data objects," Tech. Rpt. No. 91/10, Digital Equipment Corporation, Cambridge Research Laboratory, Cambridge, Mass. 02139, October 1991). One exception to the single-word limitation is the Motorola 68030 architecture, which provides a two-word COMPARE&SWAP instruction. Masselin and Pu exploit this primitive to construct an operating system kernel that employs a number of non-blocking data structures.
In copending application Ser. No. 547,618, filed Jun. 29, 1990, by Sites and Witek, for "Ensuring Data Integrity in Multiprocessor or Pipelined Processor," assigned to Digital Equipment Corporation, a processor of a 64-bit RISC architecture is disclosed. Atomic byte writes are implemented by providing load/locked and store/conditional instructions. To write to a byte address in a quadword aligned memory location, the processor loads a quadword, performs an internal byte write in the processor's register set, then conditionally stores the updated quadword in memory, depending upon whether the quadword has been written by another processor since the load/locked operation. As with the conditional stores discussed above, this operation is limited to a fixed-size memory reference, and does not provide the full range of functions needed for the transactional memory of the present invention.
Transactional memory according to this invention improves on these primitives by effectively providing the ability to implement customized read-modify-write operations that affect arbitrary regions of memory. This ability avoids the need for copying imposed by the single-word primitives, since complex atomic updates can now be applied in place. Moreover, since the number of locations affected by a transaction is limited only by processor cache size, transactional memory is strictly more flexible than two-word COMPARE&SWAP.
A second feature of transactional memory is its close integration with cache consistency protocols. There are several ways read-modify-write instructions may interact with the cache. Perhaps the simplest is to bypass the cache entirely, locking the bus for several cycles while making updates directly to memory. (Some of these protocols invalidate any cached copies of that location, while some do not.) As described in the survey article by Glew and Hwu, "A feature taxonomy and survey of synchronization primitive implementations," Technical Report CRHC-91-7, Univ. of Illinois at Urbana-Champagne, 1101 W. Springfield, Urbana, Ill. 61801, December 1990, this approach is taken by the BBN TC2000, by the early version of the EncOre Multimax, by the Motorola MC68030, by Pyramid, by the VAX 6200, and others. Locking the bus is clearly unsuitable for transactional memory.
A more sophisticated approach is to cache an exclusive copy of a memory location and to prevent other processors from acquiring that value while the read-modify-write is in progress. This technique is used by the Berkeley Protocol (see Katz et al, "Implementing a cache consistency protocol," Proc. 12th Annual Int'l Symp. on Computer Architecture, pp. 276-286, IEEE, June 1985), by the Sequent Symmetry (see Graunke et al, cited above), by the later version of the Encore MultiMax, and by the SPUR processor (the last two as described in Clew and Hsu, cited above). This technique is essentially equivalent to locking. It works well for predefined read-modify-write instructions, where the duration of locking is fixed in advance, but it is not suitable for transactional memory, where there may be an arbitrarily long interval between reading a location and modifying it.
The S1 project (see Jensen et al, cited above) uses a kind of cache invalidation scheme to implement LOAD.sub.- LINKED and STORE.sub.- CONDITIONAL synchronization primitives. When a process executes a LOAD.sub.- LINKED, it effectively caches the variable in exclusive mode. The STORE.sub.- CONDITIONAL will succeed only if that entry has not been invalidated. The transactional memory implementation, according to the invention, in contrast, is applied to multiple memory locations, and to a large class of cache consistency protocols.
The notion of a transaction, in the sense of a sequence of operations executed atomically with respect to other transactions, has been used in database technology. Many techniques have been proposed for synchronizing transactions. Although the properties provided by transactional memory have some similarity to the properties provided by database transactions, there are important differences. Most mechanisms for synchronizing database transactions are based on some form of locking (e.g., as described by Eswaran et al, "The notion of consistency and predicate locks in a database system," Communications of the ACM, Vol. 19, No. 11, pp. 624-633, November 1976, by Moss, "Nested transactions: An approach to reliable distributed computing, " Technical Report MIT/LCS/TR-260, M.I.T. Laboratory for Computer Science, April 1981, and by Reed, Implementing atomic actions on decentralized data," ACM Trans. on Computer Systems, Vol. 1, No. 1, pp. 2-23, February 1983). The database mechanisms that most closely resemble transactional memory are optimistic techniques (as described by Kung et al, "On optimistic methods for concurrency control," ACM Trans. on database Systems, Vol. 2, No. 6, pp. 213-226, June 1981) in which transactions execute without synchronization, but each transaction must be validated before it is allowed to commit. The techniques for validating transactions are entirely different in a transactional memory, however. The database techniques rely on software validation, while in a transactional memory validation is integrated with the multiprocessor's cache consistency protocol.