The present invention relates to distributed shared memory and, in particular, a method for replicating state to result in reliable distributed shared memory.
Distributed Shared Memory (DSM) has been an active field of research for a number of years. A variety of sophisticated approaches have been developed to allow processes on distinct systems to share a virtual memory address space, but nearly all of this work has been focussed on enabling shared memory based parallel scientific applications to be run on distributed systems. Examples of such scientific applications appear in computational fluid dynamics, biology, weather simulation and galaxy evolution. In studying such parallel systems, the principal focus is on achieving a high degree of performance.
While the domain of parallel scientific applications is important, distributed shared memory can also play a valuable role in the design of distributed applications. The rapid adoption of distributed object frameworks (e.g., CORBA and Java RMI) is leading to an increased number of distributed applications, whose functionality is partitioned into coarse-grained components which communicate through object interfaces. These distributed object frameworks are well suited for locating and invoking distributed functionality, and may transparently provide xe2x80x9cfailoverxe2x80x9d capabilities, failover being the capability of a system to detect failure of a component and to transfer operations to another functioning component. Many distributed applications, however, require the ability to share simple state (i.e., data) across distributed components, for which distributed shared memory can play a role.
To illustrate the need for the ability to share simple state across distributed components, consider a typical web-based service framework which allows new services to be readily added to the system. Some components of the framework deal with authenticating the user, establishing a session and presenting the user with a menu of services. The services are implemented as distinct distributed components, as are the various components of the framework itself In this type of system, a user-session object, encapsulating information about a user""s session with the system, would represent simple state that must be available to every component and which is accessed frequently during the handling of each user request. If such a user-session object were only accessible through a remote interface, obtaining information such as the user""s xe2x80x9ccustomer-idxe2x80x9d would be very expensive. Ideally, the usersession object would be replicated on nodes where it is required, and a session identifier would be used to identify each session.
As soon as data replication is considered, data consistency becomes an issue. There are a number of approaches that can be used for this purpose. As just mentioned, a typical starting point is to store shared objects on a single server and use remote object communication to access various fields. When performance is important, one will typically introduce caching mechanisms to allow local access to certain objects whenever possible. In practice, ad hoc caching and consistency schemes are used for this purpose, individually tailored for each object in question. Given the complexity and ensuing maintainability issues, such steps are not undertaken lightly.
Distributed shared memory (DSM), however, is ideally suited to this problem domain. Using xe2x80x9cweak consistencyxe2x80x9d DSM techniques, state can be very efficiently replicated onto nodes where it is required, with very little additional software complexity. When an object is first accessed on a node, its data pages are brought onto the local processor and subsequent accesses occur at memory access speeds. An underlying DSM layer maintains consistency among the various copies.
Weak consistency refers to the way in which shared memory that is replicated on different nodes is kept consistent. With weak consistency, accesses to synchronization variables are sequentially consistent, no access to a synchronization variable is allowed to be performed until all previous writes have completed everywhere and no data access (read or write) is allowed to be performed until all previous accesses to synchronization variables have been performed (see M. Dubois, C. Scheurich, and F. Briggs, xe2x80x9cMemory Access Buffering in Multiprocessors,xe2x80x9d International Symposia on Computer Architecture 1986, pp. 434-442., incorporated herein by reference).
Since existing DSM research has been focussed on parallel scientific computation, there are a number of issues that have not been addressed. First, existing DSM systems typically assume that all the nodes and processes involved in a computation are known in advance, which is not true of most distributed applications. Second, many existing DSM systems are not designed to tolerate failures, either at the node level or the application level, which will almost certainly occur in any long-running distributed application. Third, distributed applications will often have several processes running on a given node, which should be taken into account in the design of the DSM system. Finally, distributed applications have a much greater need for general-purpose memory allocation and reclamation facilities, when one is not dealing with fixed-sized multidimensional arrays allocated during application initialisation (which is typical of scientific applications).
In addition, DSM systems can be augmented to be fault tolerant by ensuring that all data is replicated to a parameterizable degree at all times. Although doing so leads to some level of overhead (on write operations), this cost may be warranted for some types of data and may still provide much better performance than storing data in secondary storage (via a database). By using a fault-tolerant DSM system for all in-memory critical data, a distributed application can easily be made to be highly available. A highly available system is one that continues to function in the presence of faults. However, unlike most fault-tolerant systems, failures in a highly available system are not transparent to clients.
Traditional in-memory data-replication schemes include primary site replication and active replica replication. In primary site replication, read and write requests for data are made to a primary (or master) site, which is responsible for ensuring that all replicas are kept consistent. If the primary fails, one of the replicas is chosen as the primary site. In active replica replication, write requests for data are made to all replica sites, using an algorithm that ensures that all writes are performed in the same order on all hosts. Read requests can be made to any replica.
Adapting a DSM system for fault tolerance is quite different than traditional in-memory data replication schemes in that the set of nodes replicating data are the ones that are actively using it. In the best case, where a single node is the principal accessor for an object and where the majority of memory operations are read operations (a read-mostly object) performance using distributed lock leasing algorithms approaches that of a local object. A distributed lock leasing algorithm is an algorithm that allows one node among a set of nodes to acquire a xe2x80x9clockxe2x80x9d on a unit of shared memory for some period of time. A lock is an example of a synchronization variable since it synchronizes the modification of a unit of memory, i.e. ensures the unit is only modified by one processor at a time. If a node should fail while holding a lock, the lock is reclaimed and granted to some other node in such a way that all correctly functioning nodes agree as to the state of the lock. Read-mostly objects that are actively shared amongst several nodes may be more costly as more lock requests will involve remote communications. Most expensive may be highly shared objects that are frequently modified.
Some alternative approaches to introducing fault tolerance have been to make the DSM xe2x80x9crecoverablexe2x80x9d, that is, allowing the system to be recovered to a previous consistent checkpoint in the event of failure. This approach is very well suited to long running parallel scientific applications, where the loss of a partial computation can be costly. However, in the context of an interactive (e.g., web-based) application, recovering to an outdated previous state provides no benefit. As such, a distributed shared memory system that offers transactional-like guarantees is desirable.
A weak consistency shared memory model is modified to result in reliable distributed shared memory that ensures that all vital data structures are properly replicated at all times. Whenever a node records changes to a unit of shared memory according to a weak consistency protocol, the node sends to a secondary node vital data structures related to that change.
In accordance with an aspect of the present invention there is provided, at a first node in a distributed shared memory system, the system implemented using the lazy release consistency protocol a method of replicating state including completing access to a synchronization variable, and, after completing the access, sending a message to a second node. The message includes an indication of a global ordering of access to the synchronization variable, an indication that a page of shared memory has undergone a modification, the page of shared memory including memory referenced by the synchronization variable and a record of the modification.
In accordance with an aspect of the present invention there is provided, at a first node in a distributed shared memory system, the system implemented using a weak consistency protocol, a method of replicating state including releasing a lock on a unit of shared memory and after releasing the lock, sending a message to a second node. The includes a vector timestamp, a write notice indicating that a page of shared memory underwent a modification while the lock was held and a record of the modification.
In accordance with a further aspect of the present invention there is provided a computer data signal including an indication of a global ordering of access to the synchronization variable, an indication that a page of shared memory has undergone a modification, the page of shared memory including memory referenced by the synchronization variable and a record of the modification.
In accordance with another aspect of the present invention there is provided a method for synchronization variable managing in a distributed shared memory system, the system implemented using a weak consistency protocol, the method including receiving an access request related to a synchronization variable, where the synchronization variable is for a unit of shared memory, determining a most recent node to have held the synchronization variable. If the most recent node to have held the synchronization variable has failed, and the failure has occurred subsequent to sending a replication message, the method further includes determining a node possessed of the replication message, the replication message including an indication of a global ordering of access to the synchronization variable, an indication that a page has undergone a modification while the synchronization variable was held, the page of shared memory including memory referenced by the synchronization variable, and a record of the modification. The method also includes forwarding the access request to the node determined to be possessed of the replication message. In a further aspect of the invention a processor, in a node manager, for carrying out this method is provided. In a still further aspect of the invention a software medium permits a general purpose computer to carry out this method.
In accordance with another aspect of the present invention there is provided, at a first node in a group of nodes in a distributed shared memory system implemented using a weak consistency protocol, a method of recovering from a failure of a second node in the group including detecting, via a group membership protocol, the failure in the second node, releasing each currently held synchronization variable, waiting for each currently held synchronization variable to be released or expire and entering a recovery operation. The recovery operation includes sending, to all nodes in the group, an indication of a global ordering of access to each synchronization variable along with an indication of each page that has undergone a modification while one synchronization variable was held, and a record of the modification, receiving from other nodes in the group a plurality of indications of global ordering of access to each synchronization variable currently held by other nodes, each indication of global ordering sent with an indication of each page that has undergone a modification while one synchronization variable was held, and a record of the modification and, subsequent to completion of the sending and receiving, applying each the received record to a shared memory.
Other aspects and features of the present invention will become apparent to those ordinarily skilled in the art upon review of the following description of specific embodiments of the invention in conjunction with the accompanying figures.